Internal change PiperOrigin-RevId: 271275031 Change-Id: I69bce2b27644a3fff7bc445c567c8fab4a8ff234

commit: fb23db2afae06caf1e0b13cd08d27e2d4ff17530 [log] [tgz]
author: Googler <noreply@google.com> Wed Sep 25 22:04:22 2019 -0700
committer: James Lemieux <jplemieux@google.com> Wed Sep 25 23:55:52 2019 -0700
tree: 755c7287290253a4b46d1496d30cebed0cbd4de2
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..baf0444
--- /dev/null
+++ b/LICENSE

@@ -0,0 +1,459 @@
+		  GNU LESSER GENERAL PUBLIC LICENSE
+		       Version 2.1, February 1999
+
+ Copyright (C) 1991, 1999 Free Software Foundation, Inc.
+ 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+[This is the first released version of the Lesser GPL.  It also counts
+ as the successor of the GNU Library Public License, version 2, hence
+ the version number 2.1.]
+
+			    Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+Licenses are intended to guarantee your freedom to share and change
+free software--to make sure the software is free for all its users.
+
+  This license, the Lesser General Public License, applies to some
+specially designated software packages--typically libraries--of the
+Free Software Foundation and other authors who decide to use it.  You
+can use it too, but we suggest you first think carefully about whether
+this license or the ordinary General Public License is the better
+strategy to use in any particular case, based on the explanations below.
+
+  When we speak of free software, we are referring to freedom of use,
+not price.  Our General Public Licenses are designed to make sure that
+you have the freedom to distribute copies of free software (and charge
+for this service if you wish); that you receive source code or can get
+it if you want it; that you can change the software and use pieces of
+it in new free programs; and that you are informed that you can do
+these things.
+
+  To protect your rights, we need to make restrictions that forbid
+distributors to deny you these rights or to ask you to surrender these
+rights.  These restrictions translate to certain responsibilities for
+you if you distribute copies of the library or if you modify it.
+
+  For example, if you distribute copies of the library, whether gratis
+or for a fee, you must give the recipients all the rights that we gave
+you.  You must make sure that they, too, receive or can get the source
+code.  If you link other code with the library, you must provide
+complete object files to the recipients, so that they can relink them
+with the library after making changes to the library and recompiling
+it.  And you must show them these terms so they know their rights.
+
+  We protect your rights with a two-step method: (1) we copyright the
+library, and (2) we offer you this license, which gives you legal
+permission to copy, distribute and/or modify the library.
+
+  To protect each distributor, we want to make it very clear that
+there is no warranty for the free library.  Also, if the library is
+modified by someone else and passed on, the recipients should know
+that what they have is not the original version, so that the original
+author's reputation will not be affected by problems that might be
+introduced by others.
+
+  Finally, software patents pose a constant threat to the existence of
+any free program.  We wish to make sure that a company cannot
+effectively restrict the users of a free program by obtaining a
+restrictive license from a patent holder.  Therefore, we insist that
+any patent license obtained for a version of the library must be
+consistent with the full freedom of use specified in this license.
+
+  Most GNU software, including some libraries, is covered by the
+ordinary GNU General Public License.  This license, the GNU Lesser
+General Public License, applies to certain designated libraries, and
+is quite different from the ordinary General Public License.  We use
+this license for certain libraries in order to permit linking those
+libraries into non-free programs.
+
+  When a program is linked with a library, whether statically or using
+a shared library, the combination of the two is legally speaking a
+combined work, a derivative of the original library.  The ordinary
+General Public License therefore permits such linking only if the
+entire combination fits its criteria of freedom.  The Lesser General
+Public License permits more lax criteria for linking other code with
+the library.
+
+  We call this license the "Lesser" General Public License because it
+does Less to protect the user's freedom than the ordinary General
+Public License.  It also provides other free software developers Less
+of an advantage over competing non-free programs.  These disadvantages
+are the reason we use the ordinary General Public License for many
+libraries.  However, the Lesser license provides advantages in certain
+special circumstances.
+
+  For example, on rare occasions, there may be a special need to
+encourage the widest possible use of a certain library, so that it becomes
+a de-facto standard.  To achieve this, non-free programs must be
+allowed to use the library.  A more frequent case is that a free
+library does the same job as widely used non-free libraries.  In this
+case, there is little to gain by limiting the free library to free
+software only, so we use the Lesser General Public License.
+
+  In other cases, permission to use a particular library in non-free
+programs enables a greater number of people to use a large body of
+free software.  For example, permission to use the GNU C Library in
+non-free programs enables many more people to use the whole GNU
+operating system, as well as its variant, the GNU/Linux operating
+system.
+
+  Although the Lesser General Public License is Less protective of the
+users' freedom, it does ensure that the user of a program that is
+linked with the Library has the freedom and the wherewithal to run
+that program using a modified version of the Library.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.  Pay close attention to the difference between a
+"work based on the library" and a "work that uses the library".  The
+former contains code derived from the library, whereas the latter must
+be combined with the library in order to run.
+
+		  GNU LESSER GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License Agreement applies to any software library or other
+program which contains a notice placed by the copyright holder or
+other authorized party saying it may be distributed under the terms of
+this Lesser General Public License (also called "this License").
+Each licensee is addressed as "you".
+
+  A "library" means a collection of software functions and/or data
+prepared so as to be conveniently linked with application programs
+(which use some of those functions and data) to form executables.
+
+  The "Library", below, refers to any such software library or work
+which has been distributed under these terms.  A "work based on the
+Library" means either the Library or any derivative work under
+copyright law: that is to say, a work containing the Library or a
+portion of it, either verbatim or with modifications and/or translated
+straightforwardly into another language.  (Hereinafter, translation is
+included without limitation in the term "modification".)
+
+  "Source code" for a work means the preferred form of the work for
+making modifications to it.  For a library, complete source code means
+all the source code for all modules it contains, plus any associated
+interface definition files, plus the scripts used to control compilation
+and installation of the library.
+
+  Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running a program using the Library is not restricted, and output from
+such a program is covered only if its contents constitute a work based
+on the Library (independent of the use of the Library in a tool for
+writing it).  Whether that is true depends on what the Library does
+and what the program that uses the Library does.
+
+  1. You may copy and distribute verbatim copies of the Library's
+complete source code as you receive it, in any medium, provided that
+you conspicuously and appropriately publish on each copy an
+appropriate copyright notice and disclaimer of warranty; keep intact
+all the notices that refer to this License and to the absence of any
+warranty; and distribute a copy of this License along with the
+Library.
+
+  You may charge a fee for the physical act of transferring a copy,
+and you may at your option offer warranty protection in exchange for a
+fee.
+
+  2. You may modify your copy or copies of the Library or any portion
+of it, thus forming a work based on the Library, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) The modified work must itself be a software library.
+
+    b) You must cause the files modified to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    c) You must cause the whole of the work to be licensed at no
+    charge to all third parties under the terms of this License.
+
+    d) If a facility in the modified Library refers to a function or a
+    table of data to be supplied by an application program that uses
+    the facility, other than as an argument passed when the facility
+    is invoked, then you must make a good faith effort to ensure that,
+    in the event an application does not supply such function or
+    table, the facility still operates, and performs whatever part of
+    its purpose remains meaningful.
+
+    (For example, a function in a library to compute square roots has
+    a purpose that is entirely well-defined independent of the
+    application.  Therefore, Subsection 2d requires that any
+    application-supplied function or table used by this function must
+    be optional: if the application does not supply it, the square
+    root function must still compute square roots.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Library,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Library, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote
+it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Library.
+
+In addition, mere aggregation of another work not based on the Library
+with the Library (or with a work based on the Library) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may opt to apply the terms of the ordinary GNU General Public
+License instead of this License to a given copy of the Library.  To do
+this, you must alter all the notices that refer to this License, so
+that they refer to the ordinary GNU General Public License, version 2,
+instead of to this License.  (If a newer version than version 2 of the
+ordinary GNU General Public License has appeared, then you can specify
+that version instead if you wish.)  Do not make any other change in
+these notices.
+
+  Once this change is made in a given copy, it is irreversible for
+that copy, so the ordinary GNU General Public License applies to all
+subsequent copies and derivative works made from that copy.
+
+  This option is useful when you wish to copy part of the code of
+the Library into a program that is not a library.
+
+  4. You may copy and distribute the Library (or a portion or
+derivative of it, under Section 2) in object code or executable form
+under the terms of Sections 1 and 2 above provided that you accompany
+it with the complete corresponding machine-readable source code, which
+must be distributed under the terms of Sections 1 and 2 above on a
+medium customarily used for software interchange.
+
+  If distribution of object code is made by offering access to copy
+from a designated place, then offering equivalent access to copy the
+source code from the same place satisfies the requirement to
+distribute the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  5. A program that contains no derivative of any portion of the
+Library, but is designed to work with the Library by being compiled or
+linked with it, is called a "work that uses the Library".  Such a
+work, in isolation, is not a derivative work of the Library, and
+therefore falls outside the scope of this License.
+
+  However, linking a "work that uses the Library" with the Library
+creates an executable that is a derivative of the Library (because it
+contains portions of the Library), rather than a "work that uses the
+library".  The executable is therefore covered by this License.
+Section 6 states terms for distribution of such executables.
+
+  When a "work that uses the Library" uses material from a header file
+that is part of the Library, the object code for the work may be a
+derivative work of the Library even though the source code is not.
+Whether this is true is especially significant if the work can be
+linked without the Library, or if the work is itself a library.  The
+threshold for this to be true is not precisely defined by law.
+
+  If such an object file uses only numerical parameters, data
+structure layouts and accessors, and small macros and small inline
+functions (ten lines or less in length), then the use of the object
+file is unrestricted, regardless of whether it is legally a derivative
+work.  (Executables containing this object code plus portions of the
+Library will still fall under Section 6.)
+
+  Otherwise, if the work is a derivative of the Library, you may
+distribute the object code for the work under the terms of Section 6.
+Any executables containing that work also fall under Section 6,
+whether or not they are linked directly with the Library itself.
+
+  6. As an exception to the Sections above, you may also combine or
+link a "work that uses the Library" with the Library to produce a
+work containing portions of the Library, and distribute that work
+under terms of your choice, provided that the terms permit
+modification of the work for the customer's own use and reverse
+engineering for debugging such modifications.
+
+  You must give prominent notice with each copy of the work that the
+Library is used in it and that the Library and its use are covered by
+this License.  You must supply a copy of this License.  If the work
+during execution displays copyright notices, you must include the
+copyright notice for the Library among them, as well as a reference
+directing the user to the copy of this License.  Also, you must do one
+of these things:
+
+    a) Accompany the work with the complete corresponding
+    machine-readable source code for the Library including whatever
+    changes were used in the work (which must be distributed under
+    Sections 1 and 2 above); and, if the work is an executable linked
+    with the Library, with the complete machine-readable "work that
+    uses the Library", as object code and/or source code, so that the
+    user can modify the Library and then relink to produce a modified
+    executable containing the modified Library.  (It is understood
+    that the user who changes the contents of definitions files in the
+    Library will not necessarily be able to recompile the application
+    to use the modified definitions.)
+
+    b) Use a suitable shared library mechanism for linking with the
+    Library.  A suitable mechanism is one that (1) uses at run time a
+    copy of the library already present on the user's computer system,
+    rather than copying library functions into the executable, and (2)
+    will operate properly with a modified version of the library, if
+    the user installs one, as long as the modified version is
+    interface-compatible with the version that the work was made with.
+
+    c) Accompany the work with a written offer, valid for at
+    least three years, to give the same user the materials
+    specified in Subsection 6a, above, for a charge no more
+    than the cost of performing this distribution.
+
+    d) If distribution of the work is made by offering access to copy
+    from a designated place, offer equivalent access to copy the above
+    specified materials from the same place.
+
+    e) Verify that the user has already received a copy of these
+    materials or that you have already sent this user a copy.
+
+  For an executable, the required form of the "work that uses the
+Library" must include any data and utility programs needed for
+reproducing the executable from it.  However, as a special exception,
+the materials to be distributed need not include anything that is
+normally distributed (in either source or binary form) with the major
+components (compiler, kernel, and so on) of the operating system on
+which the executable runs, unless that component itself accompanies
+the executable.
+
+  It may happen that this requirement contradicts the license
+restrictions of other proprietary libraries that do not normally
+accompany the operating system.  Such a contradiction means you cannot
+use both them and the Library together in an executable that you
+distribute.
+
+  7. You may place library facilities that are a work based on the
+Library side-by-side in a single library together with other library
+facilities not covered by this License, and distribute such a combined
+library, provided that the separate distribution of the work based on
+the Library and of the other library facilities is otherwise
+permitted, and provided that you do these two things:
+
+    a) Accompany the combined library with a copy of the same work
+    based on the Library, uncombined with any other library
+    facilities.  This must be distributed under the terms of the
+    Sections above.
+
+    b) Give prominent notice with the combined library of the fact
+    that part of it is a work based on the Library, and explaining
+    where to find the accompanying uncombined form of the same work.
+
+  8. You may not copy, modify, sublicense, link with, or distribute
+the Library except as expressly provided under this License.  Any
+attempt otherwise to copy, modify, sublicense, link with, or
+distribute the Library is void, and will automatically terminate your
+rights under this License.  However, parties who have received copies,
+or rights, from you under this License will not have their licenses
+terminated so long as such parties remain in full compliance.
+
+  9. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Library or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Library (or any work based on the
+Library), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Library or works based on it.
+
+  10. Each time you redistribute the Library (or any work based on the
+Library), the recipient automatically receives a license from the
+original licensor to copy, distribute, link with or modify the Library
+subject to these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties with
+this License.
+
+  11. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Library at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Library by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Library.
+
+If any portion of this section is held invalid or unenforceable under any
+particular circumstance, the balance of the section is intended to apply,
+and the section as a whole is intended to apply in other circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  12. If the distribution and/or use of the Library is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Library under this License may add
+an explicit geographical distribution limitation excluding those countries,
+so that distribution is permitted only in or among countries not thus
+excluded.  In such case, this License incorporates the limitation as if
+written in the body of this License.
+
+  13. The Free Software Foundation may publish revised and/or new
+versions of the Lesser General Public License from time to time.
+Such new versions will be similar in spirit to the present version,
+but may differ in detail to address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Library
+specifies a version number of this License which applies to it and
+"any later version", you have the option of following the terms and
+conditions either of that version or of any later version published by
+the Free Software Foundation.  If the Library does not specify a
+license version number, you may choose any version ever published by
+the Free Software Foundation.
+
+  14. If you wish to incorporate parts of the Library into other free
+programs whose distribution conditions are incompatible with these,
+write to the author to ask for permission.  For software which is
+copyrighted by the Free Software Foundation, write to the Free
+Software Foundation; we sometimes make exceptions for this.  Our
+decision will be guided by the two goals of preserving the free status
+of all derivatives of our free software and of promoting the sharing
+and reuse of software generally.
+
+			    NO WARRANTY
+
+  15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO
+WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW.
+EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR
+OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY
+KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE
+LIBRARY IS WITH YOU.  SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME
+THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+  16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN
+WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY
+AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU
+FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR
+CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
+LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
+RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
+FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
+SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
+DAMAGES.
+
+		     END OF TERMS AND CONDITIONS
+

diff --git a/Makefile.gbase b/Makefile.gbase
new file mode 100644
index 0000000..ad03d36
--- /dev/null
+++ b/Makefile.gbase

@@ -0,0 +1,248 @@
+#
+#  Copyright (C) 2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+
+# Makefile for libacml_mv library
+
+# What we're building, and where to find it.
+LIBRARY = libacml_mv.a
+
+TARGETS = $(LIBRARY)
+
+# Makefile setup
+include $(COMMONDEFS)
+
+VPATH    = $(BUILD_BASE)/src:$(BUILD_BASE)/src/gas
+
+# Compiler options
+LCOPTS = $(STD_COMPILE_OPTS) $(STD_C_OPTS)
+LCDEFS = $(HOSTDEFS) $(TARGDEFS)
+LCINCS = -I$(BUILD_BASE)/inc
+
+# CFLAGS += -Wall -W -Wstrict-prototypes -Werror -fPIC -O2 $(DEBUG)
+
+ifeq ($(BUILD_ARCH), X8664)
+
+CFILES = \
+	acos.c \
+	acosf.c \
+	acosh.c \
+	acoshf.c \
+	asin.c \
+	asinf.c \
+	asinh.c \
+	asinhf.c \
+	atan2.c \
+	atan2f.c \
+	atan.c \
+	atanf.c \
+	atanh.c \
+	atanhf.c \
+	ceil.c \
+	ceilf.c \
+	cosh.c \
+	coshf.c \
+	exp_special.c \
+	finite.c \
+	finitef.c \
+	floor.c \
+	floorf.c \
+	frexp.c \
+	frexpf.c \
+	hypot.c \
+	hypotf.c \
+	ilogb.c \
+	ilogbf.c \
+	ldexp.c \
+	ldexpf.c \
+	libm_special.c \
+	llrint.c \
+	llrintf.c \
+	llround.c \
+	llroundf.c \
+	log1p.c \
+	log1pf.c \
+	logb.c \
+	logbf.c \
+	log_special.c \
+	lrint.c \
+	lrintf.c \
+	lround.c \
+	lroundf.c \
+	modf.c \
+	modff.c \
+	nan.c \
+	nanf.c \
+	nearbyintf.c \
+	nextafter.c \
+	nextafterf.c \
+	nexttoward.c \
+	nexttowardf.c \
+	pow_special.c \
+	remainder_piby2.c \
+	remainder_piby2d2f.c \
+	rint.c \
+	rintf.c \
+	roundf.c \
+	scalbln.c \
+	scalblnf.c \
+	scalbn.c \
+	scalbnf.c \
+	sincos_special.c \
+	sinh.c \
+	sinhf.c \
+	sqrt.c \
+	sqrtf.c \
+	tan.c \
+	tanf.c \
+	tanh.c \
+	tanhf.c
+
+ASFILES = \
+	cbrtf.S \
+	cbrt.S \
+	copysignf.S \
+	copysign.S \
+	cosf.S \
+	cos.S \
+	exp10f.S \
+	exp10.S \
+	exp2f.S \
+	exp2.S \
+	expf.S \
+	expm1f.S \
+	expm1.S \
+	exp.S \
+	fabsf.S \
+	fabs.S \
+	fdimf.S \
+	fdim.S \
+	fmaxf.S \
+	fmax.S \
+	fminf.S \
+	fmin.S \
+	fmodf.S \
+	fmod.S \
+	log10f.S \
+	log10.S \
+	log2f.S \
+	log2.S \
+	logf.S \
+	log.S \
+	nearbyint.S \
+	powf.S \
+	pow.S \
+	remainderf.S \
+	remainder.S \
+	round.S \
+	sincosf.S \
+	sincos.S \
+	sinf.S \
+	sin.S \
+	truncf.S \
+	trunc.S \
+	v4hcosl.S \
+	v4helpl.S \
+	v4hfrcpal.S \
+	v4hlog10l.S \
+	v4hlog2l.S \
+	v4hlogl.S \
+	v4hsinl.S \
+	vrd2cos.S \
+	vrd2exp.S \
+	vrd2log10.S \
+	vrd2log2.S \
+	vrd2log.S \
+	vrd2sincos.S \
+	vrd2sin.S \
+	vrd4cos.S \
+	vrd4exp.S \
+	vrd4frcpa.S \
+	vrd4log10.S \
+	vrd4log2.S \
+	vrd4log.S \
+	vrd4sin.S \
+	vrdacos.S \
+	vrdaexp.S \
+	vrdalog10.S \
+	vrdalog2.S \
+	vrdalogr.S \
+	vrdalog.S \
+	vrda_scaled_logr.S \
+	vrda_scaledshifted_logr.S \
+	vrdasincos.S \
+	vrdasin.S \
+	vrs4cosf.S \
+	vrs4expf.S \
+	vrs4log10f.S \
+	vrs4log2f.S \
+	vrs4logf.S \
+	vrs4powf.S \
+	vrs4powxf.S \
+	vrs4sincosf.S \
+	vrs4sinf.S \
+	vrs8expf.S \
+	vrs8log10f.S \
+	vrs8log2f.S \
+	vrs8logf.S \
+	vrsacosf.S \
+	vrsaexpf.S \
+	vrsalog10f.S \
+	vrsalog2f.S \
+	vrsalogf.S \
+	vrsapowf.S \
+	vrsapowxf.S \
+	vrsasincosf.S \
+	vrsasinf.S
+
+else
+
+# The special processing of the -lm option in the compiler driver should
+# be delayed until all of the options have been parsed.  Until the
+# driver is cleaned up, it is important that processing be the same on
+# all architectures.  Thus we add an empty 32-bit ACML vector math
+# library.
+
+dummy.c :
+	echo "void libacml_mv_placeholder() {}" > dummy.c
+
+CFILES = dummy.c
+LDIRT += dummy.c
+
+endif
+
+
+default:
+	$(MAKE)  first
+	$(MAKE)  $(TARGETS)
+	$(MAKE)  last
+
+first : 
+ifndef SKIP_DEP_BUILD
+	$(call submake,$(BUILD_AREA)/include)
+endif
+
+last : make_libdeps
+
+include $(COMMONRULES)
+
+$(LIBRARY): $(OBJECTS)
+	$(ar) cru $@ $^
+	$(ranlib) $@
+

diff --git a/acml_trace.cc b/acml_trace.cc
new file mode 100644
index 0000000..b5c967f
--- /dev/null
+++ b/acml_trace.cc

@@ -0,0 +1,86 @@
+// Copyright 2012 Google Inc. All Rights Reserved.
+// Author: martint@google.com (Martin Thuresson)
+
+#include "third_party/open64_libacml_mv/acml_trace.h"
+
+#include <float.h>
+#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+
+#include <functional>
+#include <string>
+
+#include "base/commandlineflags.h"
+#include "base/examine_stack.h"
+#include "base/googleinit.h"
+#include "base/init_google.h"
+#include "base/logging.h"
+#include "file/base/file.h"
+#include "file/base/helpers.h"
+#include "testing/base/public/benchmark.h"
+#include "testing/base/public/googletest.h"
+#include "third_party/absl/strings/cord.h"
+#include "third_party/open64_libacml_mv/libacml.h"
+#include "util/task/status.h"
+
+template<typename T>
+std::unique_ptr<std::vector<T>> InitTrace(
+    const char* filename,
+    std::function<T(CordReader* reader)> callback) {
+  std::unique_ptr<std::vector<T>> trace(new std::vector<T>);
+  Cord cord;
+  CHECK_OK(file::GetContents(filename, &cord, file::Defaults()));
+  CordReader reader(cord);
+
+  while (!reader.done()) {
+    trace->push_back(callback(&reader));
+  }
+
+  return trace;
+}
+
+// Read a trace file with doubles.
+std::unique_ptr<std::vector<double>> GetTraceDouble(const char *filename) {
+  std::function<double(CordReader* reader)> read_double =
+      [](CordReader* reader) {
+    double d;
+    CHECK_GE(reader->Available(), sizeof(d));
+    reader->ReadN(sizeof(d), reinterpret_cast<char*>(&d));
+    return d;
+  };
+  std::unique_ptr<std::vector<double>> trace(InitTrace<double>(filename,
+                                                               read_double));
+  return trace;
+}
+
+// Read a trace file with pairs of doubles.
+std::unique_ptr<std::vector<std::pair<double, double>>> GetTraceDoublePair(
+    const char *filename) {
+  std::function<std::pair<double, double>(CordReader* reader)> read_double =
+      [](CordReader* reader) {
+    double d[2];
+    CHECK_GE(reader->Available(), sizeof(d));
+    reader->ReadN(sizeof(d), reinterpret_cast<char*>(&d));
+    return std::make_pair(d[0], d[1]);
+  };
+  std::unique_ptr<std::vector<std::pair<double, double>>> trace(
+      InitTrace<std::pair<double, double>>(filename, read_double));
+  return trace;
+}
+
+// Read a trace file with floats.
+std::unique_ptr<std::vector<float>> GetTraceFloat(const char *filename) {
+  std::function<float(CordReader* reader)> read_float =
+      [](CordReader* reader) {
+    float f;
+    const int bytes_to_read = min(sizeof(f), reader->Available());
+    reader->ReadN(bytes_to_read, reinterpret_cast<char*>(&f));
+    return f;
+  };
+  std::unique_ptr<std::vector<float>> trace(InitTrace<float>(filename,
+                                                             read_float));
+  return trace;
+}

diff --git a/acml_trace.h b/acml_trace.h
new file mode 100644
index 0000000..65eda94
--- /dev/null
+++ b/acml_trace.h

@@ -0,0 +1,25 @@
+// Copyright 2012 and onwards Google Inc.
+// Author: martint@google.com (Martin Thuresson)
+
+#ifndef THIRD_PARTY_OPEN64_LIBACML_MV_ACML_TRACE_H__
+#define THIRD_PARTY_OPEN64_LIBACML_MV_ACML_TRACE_H__
+
+// Log files gathered from a complete run of rephil/docs. Contains the
+// arguments to all exp/log/pow call.
+#define BASE_TRACE_PATH "google3/third_party/open64_libacml_mv/testdata/"
+#define EXP_LOGFILE (BASE_TRACE_PATH "/exp.rephil_docs.builtin.baseline.trace")
+#define EXPF_LOGFILE (BASE_TRACE_PATH "/expf.fastmath_unittest.trace")
+#define LOG_LOGFILE (BASE_TRACE_PATH "/log.rephil_docs.builtin.baseline.trace")
+#define POW_LOGFILE (BASE_TRACE_PATH "/pow.rephil_docs.builtin.baseline.trace")
+
+#include <memory>
+#include <vector>
+
+std::unique_ptr<std::vector<std::pair<double, double> >> GetTraceDoublePair(
+    const char *filename);
+
+std::unique_ptr<std::vector<double>> GetTraceDouble(const char *filename);
+
+std::unique_ptr<std::vector<float>> GetTraceFloat(const char *filename);
+
+#endif  // THIRD_PARTY_OPEN64_LIBACML_MV_ACML_TRACE_H__

diff --git a/acml_trace_benchmark.cc b/acml_trace_benchmark.cc
new file mode 100644
index 0000000..fb6acc4
--- /dev/null
+++ b/acml_trace_benchmark.cc

@@ -0,0 +1,272 @@
+// Copyright 2012 Google Inc. All Rights Reserved.
+// Author: martint@google.com (Martin Thuresson)
+
+#include "third_party/open64_libacml_mv/acml_trace.h"
+
+#include <float.h>
+#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+
+#include <memory>
+#include <vector>
+
+#include "base/commandlineflags.h"
+#include "base/examine_stack.h"
+#include "base/googleinit.h"
+#include "base/init_google.h"
+#include "base/logging.h"
+#include "file/base/file.h"
+#include "file/base/path.h"
+#include "testing/base/public/benchmark.h"
+#include "testing/base/public/googletest.h"
+#include "third_party/open64_libacml_mv/libacml.h"
+
+
+int main(int argc, char** argv) {
+  InitGoogle(argv[0], &argc, &argv, true);
+  RunSpecifiedBenchmarks();
+  return 0;
+}
+
+namespace {
+
+// Local typedefs to avoid repeating complex types all over the function.
+typedef std::unique_ptr<std::vector<double>> DoubleListPtr;
+typedef std::unique_ptr<std::vector<float>> FloatListPtr;
+typedef std::unique_ptr<std::vector<std::pair<double,
+                                              double>>> DoublePairListPtr;
+
+/////////////////////////
+// Benchmark log() calls.
+/////////////////////////
+
+// Measure time spent iterating through the values.
+static void BM_math_trace_read_log(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+                                                     LOG_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  // Process trace.
+  double d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+      d += *iter;
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+// Benchmark acml_log().
+static void BM_math_trace_acmllog(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+                                                     LOG_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  double d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+      d += acml_log(*iter);
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+// Benchmark log().
+static void BM_math_trace_log(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+                                                     LOG_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  double d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+      d += log(*iter);
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+
+/////////////////////////
+// Benchmark exp() calls.
+/////////////////////////
+
+// Measure time spent iterating through the values.
+static void BM_math_trace_read_exp(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+                                                    EXP_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  double d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+      d += *iter;
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+// Benchmark acml_exp().
+static void BM_math_trace_acmlexp(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+                                                    EXP_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  double d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+      d += acml_exp(*iter);
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+// Benchmark exp().
+static void BM_math_trace_exp(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+                                                    EXP_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  double d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+      d += exp(*iter);
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+/////////////////////////
+// Benchmark expf() calls.
+/////////////////////////
+
+// Measure time spent iterating through the values.
+static void BM_math_trace_read_expf(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  FloatListPtr trace(GetTraceFloat(file::JoinPath(FLAGS_test_srcdir,
+                                                  EXPF_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  float d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+      d += *iter;
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+// Benchmark acml_exp().
+static void BM_math_trace_acmlexpf(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  FloatListPtr trace(GetTraceFloat(file::JoinPath(FLAGS_test_srcdir,
+                                                  EXPF_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  float d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+      d += acml_expf(*iter);
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+// Benchmark exp().
+static void BM_math_trace_expf(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  FloatListPtr trace(GetTraceFloat(file::JoinPath(FLAGS_test_srcdir,
+                                                  EXPF_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  float d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+      d += expf(*iter);
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+
+/////////////////////////
+// Benchmark pow() calls.
+/////////////////////////
+
+// Measure time spent iterating through the values.
+static void BM_math_trace_read_pow(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  DoublePairListPtr trace(GetTraceDoublePair(file::JoinPath(
+      FLAGS_test_srcdir, POW_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  double d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto itr = trace->begin(); itr != trace->end(); ++itr) {
+      d += (*itr).first + (*itr).second;
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+// Benchmark acml_pow().
+static void BM_math_trace_acmlpow(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  DoublePairListPtr trace(GetTraceDoublePair(file::JoinPath(
+      FLAGS_test_srcdir, POW_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  double d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto itr = trace->begin(); itr != trace->end(); ++itr) {
+      d += acml_pow((*itr).first,
+                    (*itr).second);
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+// Benchmark pow().
+static void BM_math_trace_pow(int iters) {
+  // Read trace file into memory.
+  StopBenchmarkTiming();
+  DoublePairListPtr trace(GetTraceDoublePair(file::JoinPath(
+      FLAGS_test_srcdir, POW_LOGFILE).c_str()));
+  StartBenchmarkTiming();
+  double d = 0.0;
+  for (int iter = 0; iter < iters; ++iter) {
+    for (auto itr = trace->begin(); itr != trace->end(); ++itr) {
+      d += pow((*itr).first,
+               (*itr).second);
+    }
+  }
+  CHECK_NE(d, 0.0);
+}
+
+
+BENCHMARK(BM_math_trace_read_exp);
+BENCHMARK(BM_math_trace_acmlexp);
+BENCHMARK(BM_math_trace_exp);
+
+BENCHMARK(BM_math_trace_read_log);
+BENCHMARK(BM_math_trace_acmllog);
+BENCHMARK(BM_math_trace_log);
+
+BENCHMARK(BM_math_trace_read_pow);
+BENCHMARK(BM_math_trace_acmlpow);
+BENCHMARK(BM_math_trace_pow);
+
+BENCHMARK(BM_math_trace_read_expf);
+BENCHMARK(BM_math_trace_acmlexpf);
+BENCHMARK(BM_math_trace_expf);
+
+}  // namespace

diff --git a/acml_trace_validate_test.cc b/acml_trace_validate_test.cc
new file mode 100644
index 0000000..9bd682c
--- /dev/null
+++ b/acml_trace_validate_test.cc

@@ -0,0 +1,114 @@
+// Copyright 2012 Google Inc. All Rights Reserved.
+// Author: martint@google.com (Martin Thuresson)
+
+#include "third_party/open64_libacml_mv/acml_trace.h"
+
+#include <math.h>
+#include <stdio.h>
+
+#include <memory>
+#include <vector>
+
+#include "base/commandlineflags.h"
+#include "base/examine_stack.h"
+#include "base/googleinit.h"
+#include "base/init_google.h"
+#include "base/logging.h"
+#include "file/base/file.h"
+#include "file/base/path.h"
+#include "testing/base/public/benchmark.h"
+#include "testing/base/public/googletest.h"
+#include "testing/base/public/gunit.h"
+#include "third_party/open64_libacml_mv/libacml.h"
+
+
+int main(int argc, char** argv) {
+  InitGoogle(argv[0], &argc, &argv, true);
+  RunSpecifiedBenchmarks();
+  return RUN_ALL_TESTS();
+}
+
+
+// Compare two doubles given a maximum unit of least precision (ULP).
+bool AlmostEqualDoubleUlps(double A, double B, int64 maxUlps) {
+  CHECK_EQ(sizeof(A), sizeof(maxUlps));
+  if (A == B)
+    return true;
+  int64 intDiff = std::abs(*(reinterpret_cast<int64*>(&A)) -
+                           *(reinterpret_cast<int64*>(&B)));
+  return intDiff <= maxUlps;
+}
+
+// Compare two floats given a maximum unit of least precision (ULP).
+bool AlmostEqualFloatUlps(float A, float B, int32 maxUlps) {
+  CHECK_EQ(sizeof(A), sizeof(maxUlps));
+  if (A == B)
+    return true;
+  int32 intDiff = abs(*(reinterpret_cast<int32*>(&A)) -
+                      *(reinterpret_cast<int32*>(&B)));
+  return intDiff <= maxUlps;
+}
+
+TEST(Case, LogTest) {
+  // Read trace file into memory.
+  std::unique_ptr<std::vector<double>> trace(
+      GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+                                    LOG_LOGFILE).c_str()));
+  double d1;
+  double d2;
+  for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+    d1 = acml_log(*iter);
+    d2 = log(*iter);
+    // Make sure difference is at most 1 ULP.
+    EXPECT_TRUE(AlmostEqualDoubleUlps(d1, d2, 1));
+  }
+}
+
+TEST(Case, ExpTest) {
+  // Read trace file into memory.
+  std::unique_ptr<std::vector<double>> trace(
+      GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+                                    EXP_LOGFILE).c_str()));
+  double d1;
+  double d2;
+  for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+    d1 = acml_exp(*iter);
+    d2 = exp(*iter);
+    // Make sure difference is at most 1 ULP.
+    EXPECT_TRUE(AlmostEqualDoubleUlps(d1, d2, 1));
+  }
+}
+
+
+TEST(Case, ExpfTest) {
+  // Read trace file into memory.
+  std::unique_ptr<std::vector<float>> trace(
+      GetTraceFloat(file::JoinPath(FLAGS_test_srcdir,
+                                   EXPF_LOGFILE).c_str()));
+  float f1;
+  float f2;
+  for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+    f1 = acml_expf(*iter);
+    f2 = expf(*iter);
+    // Make sure difference is at most 1 ULP.
+    EXPECT_TRUE(AlmostEqualFloatUlps(f1, f2, 1));
+  }
+}
+
+
+TEST(Case, PowTest) {
+  // Read trace file into memory.
+  std::unique_ptr<std::vector<std::pair<double, double>>> trace(
+      GetTraceDoublePair(file::JoinPath(FLAGS_test_srcdir,
+                                        POW_LOGFILE).c_str()));
+  double d1;
+  double d2;
+  for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+    d1 = acml_pow((*iter).first,
+                  (*iter).second);
+    d2 = pow((*iter).first,
+             (*iter).second);
+    // Make sure difference is at most 1 ULP.
+    EXPECT_TRUE(AlmostEqualDoubleUlps(d1, d2, 1));
+  }
+}

diff --git a/inc/acml_mv.h b/inc/acml_mv.h
new file mode 100644
index 0000000..49b7feb
--- /dev/null
+++ b/inc/acml_mv.h

@@ -0,0 +1,81 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+/* 
+** A header file defining the C prototypes for the fast/vector libm functions
+*/
+
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+
+/*
+** The scalar routines.
+*/
+double fastexp(double);
+double fastlog(double);
+double fastlog10(double);
+double fastlog2(double);
+double fastpow(double,double);
+double fastsin(double);
+double fastcos(double);
+void fastsincos(double , double *, double *);
+
+float fastexpf(float );
+float fastlogf(float );
+float fastlog10f(float );
+float fastlog2f(float );
+float fastpowf(float,float);
+float fastcosf(float );
+float fastsinf(float );
+void fastsincosf(float, float *,float *);
+
+
+/*
+** The array routines.
+*/
+void vrda_exp(int, double *, double *);
+void vrda_log(int, double *, double *);
+void vrda_log10(int, double *, double *);
+void vrda_log2(int, double *, double *);
+void vrda_sin(int, double *, double *);
+void vrda_cos(int, double *, double *);
+void vrda_sincos(int, double *, double *, double *);
+
+void vrsa_expf(int, float *, float *);
+void vrsa_logf(int, float *, float *);
+void vrsa_log10f(int, float *, float *);
+void vrsa_log2f(int, float *, float *);
+void vrsa_powf(int n, float *x, float *y, float *z);
+void vrsa_powxf(int n, float *x, float y, float *z);
+void vrsa_sinf(int, float *, float *);
+void vrsa_cosf(int, float *, float *);
+void vrsa_sincosf(int, float *, float *, float *);
+
+
+#ifdef __cplusplus
+}
+#endif

diff --git a/inc/acml_mv_m128.h b/inc/acml_mv_m128.h
new file mode 100644
index 0000000..c783fe3
--- /dev/null
+++ b/inc/acml_mv_m128.h

@@ -0,0 +1,103 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+/* 
+** A header file defining the C prototypes for the fast/vector libm functions
+*/
+
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+
+/*
+** The scalar routines.
+*/
+double fastexp(double);
+double fastlog(double);
+double fastlog10(double);
+double fastlog2(double);
+double fastpow(double,double);
+double fastsin(double);
+double fastcos(double);
+void fastsincos(double , double *, double *);
+
+float fastexpf(float );
+float fastlogf(float );
+float fastlog10f(float );
+float fastlog2f(float );
+float fastpowf(float,float);
+float fastcosf(float );
+float fastsinf(float );
+void fastsincosf(float, float *,float *);
+
+/*
+** The single vector routines.
+*/
+__m128d __vrd2_log(__m128d);
+__m128d __vrd2_exp(__m128d);
+__m128d __vrd2_log10(__m128d);
+__m128d __vrd2_log2(__m128d);
+__m128d __vrd2_sin(__m128d);
+__m128d __vrd2_cos(__m128d);
+void __vrd2_sincos(__m128d, __m128d *, __m128d *);
+
+__m128 __vrs4_expf(__m128);
+__m128 __vrs4_logf(__m128);
+__m128 __vrs4_log10f(__m128);
+__m128 __vrs4_log2f(__m128);
+__m128 __vrs4_powf(__m128,__m128);
+__m128 __vrs4_powxf(__m128 x,float y);
+__m128 __vrs4_sinf(__m128);
+__m128 __vrs4_cosf(__m128);
+void __vrs4_sincosf(__m128, __m128 *, __m128 *);
+
+
+/*
+** The array routines.
+*/
+void vrda_exp(int, double *, double *);
+void vrda_log(int, double *, double *);
+void vrda_log10(int, double *, double *);
+void vrda_log2(int, double *, double *);
+void vrda_sin(int, double *, double *);
+void vrda_cos(int, double *, double *);
+void vrda_sincos(int, double *, double *, double *);
+
+void vrsa_expf(int, float *, float *);
+void vrsa_logf(int, float *, float *);
+void vrsa_log10f(int, float *, float *);
+void vrsa_log2f(int, float *, float *);
+void vrsa_powf(int n, float *x, float *y, float *z);
+void vrsa_powxf(int n, float *x, float y, float *z);
+void vrsa_sinf(int, float *, float *);
+void vrsa_cosf(int, float *, float *);
+void vrsa_sincosf(int, float *, float *, float *);
+
+
+
+#ifdef __cplusplus
+}
+#endif

diff --git a/inc/fn_macros.h b/inc/fn_macros.h
new file mode 100644
index 0000000..afc2f59
--- /dev/null
+++ b/inc/fn_macros.h

@@ -0,0 +1,47 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef __FN_MACROS_H__
+#define __FN_MACROS_H__
+
+#if defined(WINDOWS)
+#pragma warning( disable : 4985 )
+#define FN_PROTOTYPE(fn_name) acml_impl_##fn_name
+#else
+/* For Linux we prepend function names by a double underscore */
+#define ACML_CONCAT(x,y) x##y
+/* #define FN_PROTOTYPE(fn_name) concat(__,fn_name) */
+#define FN_PROTOTYPE(fn_name) ACML_CONCAT(acml_impl_,fn_name) /* commenting out previous line for build success, !!!!! REVISIT THIS SOON !!!!! */
+#endif
+
+
+#if defined(WINDOWS)
+#define weak_alias(name, aliasname) /* as nothing */
+#else
+/* Define ALIASNAME as a weak alias for NAME.
+   If weak aliases are not available, this defines a strong alias.  */
+#define weak_alias(name, aliasname) /* _weak_alias (name, aliasname) */ /* !!!!! REVISIT THIS SOON !!!!! */
+#define _weak_alias(name, aliasname) extern __typeof (name) aliasname __attribute__ ((weak, alias (#name))); 
+#endif
+
+#endif // __FN_MACROS_H__

diff --git a/inc/libm_amd.h b/inc/libm_amd.h
new file mode 100644
index 0000000..66cd46c
--- /dev/null
+++ b/inc/libm_amd.h

@@ -0,0 +1,225 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef LIBM_AMD_H_INCLUDED
+#define LIBM_AMD_H_INCLUDED 1
+
+#include <emmintrin.h>
+#include "acml_mv.h"
+#include "acml_mv_m128.h"
+
+#include "fn_macros.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+
+ double FN_PROTOTYPE(cbrt)(double x);
+ float FN_PROTOTYPE(cbrtf)(float x);
+
+ double FN_PROTOTYPE(fabs)(double x);
+ float FN_PROTOTYPE(fabsf)(float x);
+
+double FN_PROTOTYPE(acos)(double x);
+ float FN_PROTOTYPE(acosf)(float x);
+
+ double FN_PROTOTYPE(acosh)(double x);
+ float FN_PROTOTYPE(acoshf)(float x);
+
+ double FN_PROTOTYPE(asin)(double x);
+ float FN_PROTOTYPE(asinf)(float x);
+
+ double FN_PROTOTYPE( asinh)(double x);
+ float FN_PROTOTYPE(asinhf)(float x);
+
+ double FN_PROTOTYPE( atan)(double x);
+ float FN_PROTOTYPE(atanf)(float x);
+
+ double FN_PROTOTYPE( atanh)(double x);
+ float FN_PROTOTYPE(atanhf)(float x);
+
+ double FN_PROTOTYPE( atan2)(double x, double y);
+ float FN_PROTOTYPE(atan2f)(float x, float y);
+
+ double FN_PROTOTYPE( ceil)(double x);
+ float FN_PROTOTYPE(ceilf)(float x);
+
+
+ double FN_PROTOTYPE( cos)(double x);
+ float FN_PROTOTYPE(cosf)(float x);
+
+ double FN_PROTOTYPE( cosh)(double x);
+ float FN_PROTOTYPE(coshf)(float x);
+
+ double FN_PROTOTYPE( exp)(double x);
+ float FN_PROTOTYPE(expf)(float x);
+
+ double FN_PROTOTYPE( expm1)(double x);
+ float FN_PROTOTYPE(expm1f)(float x);
+
+ double FN_PROTOTYPE( exp2)(double x);
+ float FN_PROTOTYPE(exp2f)(float x);
+
+ double FN_PROTOTYPE( exp10)(double x);
+ float FN_PROTOTYPE(exp10f)(float x);
+
+
+ double FN_PROTOTYPE( fdim)(double x, double y);
+ float FN_PROTOTYPE(fdimf)(float x, float y);
+
+#ifdef WINDOWS
+ int FN_PROTOTYPE(finite)(double x);
+ int FN_PROTOTYPE(finitef)(float x);
+#else
+ int FN_PROTOTYPE(finite)(double x);
+ int FN_PROTOTYPE(finitef)(float x);
+#endif
+
+ double FN_PROTOTYPE( floor)(double x);
+ float FN_PROTOTYPE(floorf)(float x);
+
+ double FN_PROTOTYPE( fmax)(double x, double y);
+ float FN_PROTOTYPE(fmaxf)(float x, float y);
+
+ double FN_PROTOTYPE( fmin)(double x, double y);
+ float FN_PROTOTYPE(fminf)(float x, float y);
+
+ double FN_PROTOTYPE( fmod)(double x, double y);
+ float FN_PROTOTYPE(fmodf)(float x, float y);
+
+#ifdef WINDOWS
+ double FN_PROTOTYPE( hypot)(double x, double y);
+ float FN_PROTOTYPE(hypotf)(float x, float y);
+#else
+ double FN_PROTOTYPE( hypot)(double x, double y);
+ float FN_PROTOTYPE(hypotf)(float x, float y);
+#endif
+
+ float FN_PROTOTYPE(ldexpf)(float x, int exp);
+
+ double FN_PROTOTYPE(ldexp)(double x, int exp);
+
+ double FN_PROTOTYPE( log)(double x);
+ float FN_PROTOTYPE(logf)(float x);
+
+
+ float FN_PROTOTYPE(log2f)(float x);
+
+ double FN_PROTOTYPE( log10)(double x);
+ float FN_PROTOTYPE(log10f)(float x);
+
+
+ float FN_PROTOTYPE(log1pf)(float x);
+
+#ifdef WINDOWS
+ double FN_PROTOTYPE( logb)(double x);
+ float FN_PROTOTYPE(logbf)(float x);
+#else
+ double FN_PROTOTYPE( logb)(double x);
+ float FN_PROTOTYPE(logbf)(float x);
+#endif
+
+ double FN_PROTOTYPE( modf)(double x, double *iptr);
+ float FN_PROTOTYPE(modff)(float x, float *iptr);
+
+ double FN_PROTOTYPE( nextafter)(double x, double y);
+ float FN_PROTOTYPE(nextafterf)(float x, float y);
+
+ double FN_PROTOTYPE( pow)(double x, double y);
+ float FN_PROTOTYPE(powf)(float x, float y);
+
+double FN_PROTOTYPE( remainder)(double x, double y);
+ float FN_PROTOTYPE(remainderf)(float x, float y);
+
+ double FN_PROTOTYPE(sin)(double x);
+ float FN_PROTOTYPE(sinf)(float x);
+
+ void FN_PROTOTYPE(sincos)(double x, double *s, double *c);
+ void FN_PROTOTYPE(sincosf)(float x, float *s, float *c);
+
+ double FN_PROTOTYPE( sinh)(double x);
+ float FN_PROTOTYPE(sinhf)(float x);
+
+ double FN_PROTOTYPE( sqrt)(double x);
+ float FN_PROTOTYPE(sqrtf)(float x);
+
+ double FN_PROTOTYPE( tan)(double x);
+ float FN_PROTOTYPE(tanf)(float x);
+
+ double FN_PROTOTYPE( tanh)(double x);
+ float FN_PROTOTYPE(tanhf)(float x);
+
+ double FN_PROTOTYPE( trunc)(double x);
+ float FN_PROTOTYPE(truncf)(float x);
+
+ double FN_PROTOTYPE( log1p)(double x);
+ double FN_PROTOTYPE( log2)(double x);
+
+ double FN_PROTOTYPE(cosh)(double x);
+ float FN_PROTOTYPE(coshf)(float fx);
+
+ double FN_PROTOTYPE(frexp)(double value, int *exp);
+ float FN_PROTOTYPE(frexpf)(float value, int *exp);
+ int FN_PROTOTYPE(ilogb)(double x);
+ int FN_PROTOTYPE(ilogbf)(float x);
+
+ long long int FN_PROTOTYPE(llrint)(double x);
+ long long int FN_PROTOTYPE(llrintf)(float x);
+ long int FN_PROTOTYPE(lrint)(double x);
+ long int FN_PROTOTYPE(lrintf)(float x);
+ long int FN_PROTOTYPE(lround)(double d);
+ long int FN_PROTOTYPE(lroundf)(float f);
+ double  FN_PROTOTYPE(nan)(const char *tagp);
+ float  FN_PROTOTYPE(nanf)(const char *tagp);
+ float FN_PROTOTYPE(nearbyintf)(float x);
+ double FN_PROTOTYPE(nearbyint)(double x);
+ double FN_PROTOTYPE(nextafter)(double x, double y);
+ float FN_PROTOTYPE(nextafterf)(float x, float y);
+ double FN_PROTOTYPE(nexttoward)(double x, long double y);
+ float FN_PROTOTYPE(nexttowardf)(float x, long double y);
+ double FN_PROTOTYPE(rint)(double x);
+ float FN_PROTOTYPE(rintf)(float x);
+ float FN_PROTOTYPE(roundf)(float f);
+ double FN_PROTOTYPE(round)(double f);
+ double FN_PROTOTYPE(scalbln)(double x, long int n);
+ float FN_PROTOTYPE(scalblnf)(float x, long int n);
+ double FN_PROTOTYPE(scalbn)(double x, int n);
+ float FN_PROTOTYPE(scalbnf)(float x, int n);
+ long long int FN_PROTOTYPE(llroundf)(float f);
+ long long int FN_PROTOTYPE(llround)(double d);
+
+
+#ifdef WINDOWS
+ double FN_PROTOTYPE(copysign)(double x, double y);
+ float FN_PROTOTYPE(copysignf)(float x, float y);
+#else
+ double FN_PROTOTYPE(copysign)(double x, double y);
+ float FN_PROTOTYPE(copysignf)(float x, float y);
+#endif
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* LIBM_AMD_H_INCLUDED */

diff --git a/inc/libm_errno_amd.h b/inc/libm_errno_amd.h
new file mode 100644
index 0000000..1e6b8b9
--- /dev/null
+++ b/inc/libm_errno_amd.h

@@ -0,0 +1,33 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef LIBM_ERRNO_AMD_H_INCLUDED
+#define LIBM_ERRNO_AMD_H_INCLUDED 1
+
+#include <stdio.h>
+#include <errno.h>
+#ifndef __set_errno
+#define __set_errno(x) errno = (x)
+#endif
+
+#endif /* LIBM_ERRNO_AMD_H_INCLUDED */

diff --git a/inc/libm_inlines_amd.h b/inc/libm_inlines_amd.h
new file mode 100644
index 0000000..a2e387a
--- /dev/null
+++ b/inc/libm_inlines_amd.h

@@ -0,0 +1,2188 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef LIBM_INLINES_AMD_H_INCLUDED
+#define LIBM_INLINES_AMD_H_INCLUDED 1
+
+#include "libm_util_amd.h"
+#include <math.h>
+
+#ifdef WINDOWS
+#define inline __inline
+#include "emmintrin.h"
+#endif
+
+/* Compile-time verification that type long is the same size
+   as type double (i.e. we are really on a 64-bit machine) */
+void check_long_against_double_size(int machine_is_64_bit[(sizeof(long long) == sizeof(double))?1:-1]); 
+
+/* Set defines for inline functions calling other inlines */
+#if defined(USE_VAL_WITH_FLAGS) || defined(USE_VALF_WITH_FLAGS) || \
+    defined(USE_ZERO_WITH_FLAGS) || defined(USE_ZEROF_WITH_FLAGS) || \
+    defined(USE_NAN_WITH_FLAGS) || defined(USE_NANF_WITH_FLAGS) || \
+    defined(USE_INDEFINITE_WITH_FLAGS) || defined(USE_INDEFINITEF_WITH_FLAGS) || \
+    defined(USE_INFINITY_WITH_FLAGS) || defined(USE_INFINITYF_WITH_FLAGS) || \
+    defined(USE_SQRT_AMD_INLINE) || defined(USE_SQRTF_AMD_INLINE) || \
+    (defined(WINDOWS) && (defined(USE_HANDLE_ERROR) || defined(USE_HANDLE_ERRORF)))
+#undef USE_RAISE_FPSW_FLAGS
+#define USE_RAISE_FPSW_FLAGS 1
+#endif
+
+#if defined(USE_SPLITDOUBLE)
+/* Splits double x into exponent e and mantissa m, where 0.5 <= abs(m) < 1.0.
+   Assumes that x is not zero, denormal, infinity or NaN, but these conditions
+   are not checked */
+static inline void splitDouble(double x, int *e, double *m)
+{
+  unsigned long long ux, uy;
+  GET_BITS_DP64(x, ux);
+  uy = ux;
+  ux &= EXPBITS_DP64;
+  ux >>= EXPSHIFTBITS_DP64;
+  *e = (int)ux - EXPBIAS_DP64 + 1;
+  uy = (uy & (SIGNBIT_DP64 | MANTBITS_DP64)) | HALFEXPBITS_DP64;
+  PUT_BITS_DP64(uy, x);
+  *m = x;
+}
+#endif /* USE_SPLITDOUBLE */
+
+
+#if defined(USE_SPLITDOUBLE_2)
+/* Splits double x into exponent e and mantissa m, where 1.0 <= abs(m) < 4.0.
+   Assumes that x is not zero, denormal, infinity or NaN, but these conditions
+   are not checked. Also assumes EXPBIAS_DP is odd. With this
+   assumption, e will be even on exit. */
+static inline void splitDouble_2(double x, int *e, double *m)
+{
+  unsigned long long ux, vx;
+  GET_BITS_DP64(x, ux);
+  vx = ux;
+  ux &= EXPBITS_DP64;
+  ux >>= EXPSHIFTBITS_DP64;
+  if (ux & 1)
+    {
+      /* The exponent is odd */
+      vx = (vx & (SIGNBIT_DP64 | MANTBITS_DP64)) | ONEEXPBITS_DP64;
+      PUT_BITS_DP64(vx, x);
+      *m = x;
+      *e = ux - EXPBIAS_DP64;
+    }
+  else
+    {
+      /* The exponent is even */
+      vx = (vx & (SIGNBIT_DP64 | MANTBITS_DP64)) | TWOEXPBITS_DP64;
+      PUT_BITS_DP64(vx, x);
+      *m = x;
+      *e = ux - EXPBIAS_DP64 - 1;
+    }
+}
+#endif /* USE_SPLITDOUBLE_2 */
+
+
+#if defined(USE_SPLITFLOAT)
+/* Splits float x into exponent e and mantissa m, where 0.5 <= abs(m) < 1.0.
+   Assumes that x is not zero, denormal, infinity or NaN, but these conditions
+   are not checked */
+static inline void splitFloat(float x, int *e, float *m)
+{
+  unsigned int ux, uy;
+  GET_BITS_SP32(x, ux);
+  uy = ux;
+  ux &= EXPBITS_SP32;
+  ux >>= EXPSHIFTBITS_SP32;
+  *e = (int)ux - EXPBIAS_SP32 + 1;
+  uy = (uy & (SIGNBIT_SP32 | MANTBITS_SP32)) | HALFEXPBITS_SP32;
+  PUT_BITS_SP32(uy, x);
+  *m = x;
+}
+#endif /* USE_SPLITFLOAT */
+
+
+#if defined(USE_SCALEDOUBLE_1)
+/* Scales the double x by 2.0**n.
+   Assumes EMIN <= n <= EMAX, though this condition is not checked. */
+static inline double scaleDouble_1(double x, int n)
+{
+  double t;
+  /* Construct the number t = 2.0**n */
+  PUT_BITS_DP64(((long long)n + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t);
+  return x*t;
+}
+#endif /* USE_SCALEDOUBLE_1 */
+
+
+#if defined(USE_SCALEDOUBLE_2)
+/* Scales the double x by 2.0**n.
+   Assumes 2*EMIN <= n <= 2*EMAX, though this condition is not checked. */
+static inline double scaleDouble_2(double x, int n)
+{
+  double t1, t2;
+  int n1, n2;
+  n1 = n / 2;
+  n2 = n - n1;
+  /* Construct the numbers t1 = 2.0**n1 and t2 = 2.0**n2 */
+  PUT_BITS_DP64(((long long)n1 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t1);
+  PUT_BITS_DP64(((long long)n2 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t2);
+  return (x*t1)*t2;
+}
+#endif /* USE_SCALEDOUBLE_2 */
+
+
+#if defined(USE_SCALEDOUBLE_3)
+/* Scales the double x by 2.0**n.
+   Assumes 3*EMIN <= n <= 3*EMAX, though this condition is not checked. */
+static inline double scaleDouble_3(double x, int n)
+{
+  double t1, t2, t3;
+  int n1, n2, n3;
+  n1 = n / 3;
+  n2 = (n - n1) / 2;
+  n3 = n - n1 - n2;
+  /* Construct the numbers t1 = 2.0**n1, t2 = 2.0**n2 and t3 = 2.0**n3 */
+  PUT_BITS_DP64(((long long)n1 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t1);
+  PUT_BITS_DP64(((long long)n2 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t2);
+  PUT_BITS_DP64(((long long)n3 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t3);
+  return ((x*t1)*t2)*t3;
+}
+#endif /* USE_SCALEDOUBLE_3 */
+
+
+#if defined(USE_SCALEFLOAT_1)
+/* Scales the float x by 2.0**n.
+   Assumes EMIN <= n <= EMAX, though this condition is not checked. */
+static inline float scaleFloat_1(float x, int n)
+{
+  float t;
+  /* Construct the number t = 2.0**n */
+  PUT_BITS_SP32((n + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t);
+  return x*t;
+}
+#endif /* USE_SCALEFLOAT_1 */
+
+
+#if defined(USE_SCALEFLOAT_2)
+/* Scales the float x by 2.0**n.
+   Assumes 2*EMIN <= n <= 2*EMAX, though this condition is not checked. */
+static inline float scaleFloat_2(float x, int n)
+{
+  float t1, t2;
+  int n1, n2;
+  n1 = n / 2;
+  n2 = n - n1;
+  /* Construct the numbers t1 = 2.0**n1 and t2 = 2.0**n2 */
+  PUT_BITS_SP32((n1 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t1);
+  PUT_BITS_SP32((n2 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t2);
+  return (x*t1)*t2;
+}
+#endif /* USE_SCALEFLOAT_2 */
+
+
+#if defined(USE_SCALEFLOAT_3)
+/* Scales the float x by 2.0**n.
+   Assumes 3*EMIN <= n <= 3*EMAX, though this condition is not checked. */
+static inline float scaleFloat_3(float x, int n)
+{
+  float t1, t2, t3;
+  int n1, n2, n3;
+  n1 = n / 3;
+  n2 = (n - n1) / 2;
+  n3 = n - n1 - n2;
+  /* Construct the numbers t1 = 2.0**n1, t2 = 2.0**n2 and t3 = 2.0**n3 */
+  PUT_BITS_SP32((n1 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t1);
+  PUT_BITS_SP32((n2 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t2);
+  PUT_BITS_SP32((n3 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t3);
+  return ((x*t1)*t2)*t3;
+}
+#endif /* USE_SCALEFLOAT_3 */
+
+#if defined(USE_SETPRECISIONDOUBLE)
+unsigned int setPrecisionDouble(void)
+{
+  unsigned int cw, cwold = 0;
+  /* There is no precision control on Hammer */
+  return cwold;
+}
+#endif /* USE_SETPRECISIONDOUBLE */
+
+#if defined(USE_RESTOREPRECISION)
+void restorePrecision(unsigned int cwold)
+{
+#if defined(WINDOWS)
+  /* There is no precision control on Hammer */
+#elif defined(linux)
+  /* There is no precision control on Hammer */
+#else
+#error Unknown machine
+#endif
+  return;
+}
+#endif /* USE_RESTOREPRECISION */
+
+
+#if defined(USE_CLEAR_FPSW_FLAGS)
+/* Clears floating-point status flags. The argument should be
+   the bitwise or of the flags to be cleared, from the
+   list above, e.g.
+     clear_fpsw_flags(AMD_F_INEXACT | AMD_F_INVALID);
+ */
+static inline void clear_fpsw_flags(int flags)
+{
+#if defined(WINDOWS)
+  unsigned int cw = _mm_getcsr();
+  cw &= (~flags);
+  _mm_setcsr(cw);
+#elif defined(linux)
+  unsigned int cw;
+  /* Get the current floating-point control/status word */
+  asm volatile ("STMXCSR %0" : "=m" (cw));
+  cw &= (~flags);
+  asm volatile ("LDMXCSR %0" : : "m" (cw));
+#else
+#error Unknown machine
+#endif
+}
+#endif /* USE_CLEAR_FPSW_FLAGS */
+
+
+#if defined(USE_RAISE_FPSW_FLAGS)
+/* Raises floating-point status flags. The argument should be
+   the bitwise or of the flags to be raised, from the
+   list above, e.g.
+     raise_fpsw_flags(AMD_F_INEXACT | AMD_F_INVALID);
+ */
+static inline void raise_fpsw_flags(int flags)
+{
+#if defined(WINDOWS)
+  _mm_setcsr(_mm_getcsr() | flags);
+#elif defined(linux)
+  unsigned int cw;
+  /* Get the current floating-point control/status word */
+  asm volatile ("STMXCSR %0" : "=m" (cw));
+  cw |= flags;
+  asm volatile ("LDMXCSR %0" : : "m" (cw));
+#else
+#error Unknown machine
+#endif
+}
+#endif /* USE_RAISE_FPSW_FLAGS */
+
+
+#if defined(USE_GET_FPSW_INLINE)
+/* Return the current floating-point status word */
+static inline unsigned int get_fpsw_inline(void)
+{
+#if defined(WINDOWS)
+  return _mm_getcsr();
+#elif defined(linux)
+  unsigned int sw;
+  asm volatile ("STMXCSR %0" : "=m" (sw));
+  return sw;
+#else
+#error Unknown machine
+#endif
+}
+#endif /* USE_GET_FPSW_INLINE */
+
+#if defined(USE_SET_FPSW_INLINE)
+/* Set the floating-point status word */
+static inline void set_fpsw_inline(unsigned int sw)
+{
+#if defined(WINDOWS)
+  _mm_setcsr(sw);
+#elif defined(linux)
+  /* Set the current floating-point control/status word */
+  asm volatile ("LDMXCSR %0" : : "m" (sw));
+#else
+#error Unknown machine
+#endif
+}
+#endif /* USE_SET_FPSW_INLINE */
+
+#if defined(USE_CLEAR_FPSW_INLINE)
+/* Clear all exceptions from the floating-point status word */
+static inline void clear_fpsw_inline(void)
+{
+#if defined(WINDOWS)
+  unsigned int cw;
+  cw = _mm_getcsr();
+  cw &= ~(AMD_F_INEXACT | AMD_F_UNDERFLOW | AMD_F_OVERFLOW |
+          AMD_F_DIVBYZERO | AMD_F_INVALID);
+  _mm_setcsr(cw);
+#elif defined(linux)
+  unsigned int cw;
+  /* Get the current floating-point control/status word */
+  asm volatile ("STMXCSR %0" : "=m" (cw));
+  cw &= ~(AMD_F_INEXACT | AMD_F_UNDERFLOW | AMD_F_OVERFLOW |
+          AMD_F_DIVBYZERO | AMD_F_INVALID);
+  asm volatile ("LDMXCSR %0" : : "m" (cw));
+#else
+#error Unknown machine
+#endif
+}
+#endif /* USE_CLEAR_FPSW_INLINE */
+
+
+#if defined(USE_VAL_WITH_FLAGS)
+/* Returns a double value after raising the given flags,
+  e.g.  val_with_flags(x, AMD_F_INEXACT);
+ */
+static inline double val_with_flags(double val, int flags)
+{
+  raise_fpsw_flags(flags);
+  return val;
+}
+#endif /* USE_VAL_WITH_FLAGS */
+
+#if defined(USE_VALF_WITH_FLAGS)
+/* Returns a float value after raising the given flags,
+  e.g.  valf_with_flags(x, AMD_F_INEXACT);
+ */
+static inline float valf_with_flags(float val, int flags)
+{
+  raise_fpsw_flags(flags);
+  return val;
+}
+#endif /* USE_VALF_WITH_FLAGS */
+
+
+#if defined(USE_ZERO_WITH_FLAGS)
+/* Returns a double +zero after raising the given flags,
+  e.g.  zero_with_flags(AMD_F_INEXACT | AMD_F_INVALID);
+ */
+static inline double zero_with_flags(int flags)
+{
+  raise_fpsw_flags(flags);
+  return 0.0;
+}
+#endif /* USE_ZERO_WITH_FLAGS */
+
+
+#if defined(USE_ZEROF_WITH_FLAGS)
+/* Returns a float +zero after raising the given flags,
+  e.g.  zerof_with_flags(AMD_F_INEXACT | AMD_F_INVALID);
+ */
+static inline float zerof_with_flags(int flags)
+{
+  raise_fpsw_flags(flags);
+  return 0.0F;
+}
+#endif /* USE_ZEROF_WITH_FLAGS */
+
+
+#if defined(USE_NAN_WITH_FLAGS)
+/* Returns a double quiet +nan after raising the given flags,
+   e.g.  nan_with_flags(AMD_F_INVALID);
+*/
+static inline double nan_with_flags(int flags)
+{
+  double z;
+  raise_fpsw_flags(flags);
+  PUT_BITS_DP64(0x7ff8000000000000, z);
+  return z;
+}
+#endif /* USE_NAN_WITH_FLAGS */
+
+#if defined(USE_NANF_WITH_FLAGS)
+/* Returns a float quiet +nan after raising the given flags,
+   e.g.  nanf_with_flags(AMD_F_INVALID);
+*/
+static inline float nanf_with_flags(int flags)
+{
+  float z;
+  raise_fpsw_flags(flags);
+  PUT_BITS_SP32(0x7fc00000, z);
+  return z;
+}
+#endif /* USE_NANF_WITH_FLAGS */
+
+
+#if defined(USE_INDEFINITE_WITH_FLAGS)
+/* Returns a double indefinite after raising the given flags,
+   e.g.  indefinite_with_flags(AMD_F_INVALID);
+*/
+static inline double indefinite_with_flags(int flags)
+{
+  double z;
+  raise_fpsw_flags(flags);
+  PUT_BITS_DP64(0xfff8000000000000, z);
+  return z;
+}
+#endif /* USE_INDEFINITE_WITH_FLAGS */
+
+#if defined(USE_INDEFINITEF_WITH_FLAGS)
+/* Returns a float quiet +indefinite after raising the given flags,
+   e.g.  indefinitef_with_flags(AMD_F_INVALID);
+*/
+static inline float indefinitef_with_flags(int flags)
+{
+  float z;
+  raise_fpsw_flags(flags);
+  PUT_BITS_SP32(0xffc00000, z);
+  return z;
+}
+#endif /* USE_INDEFINITEF_WITH_FLAGS */
+
+
+#ifdef USE_INFINITY_WITH_FLAGS
+/* Returns a positive double infinity after raising the given flags,
+   e.g.  infinity_with_flags(AMD_F_OVERFLOW);
+*/
+static inline double infinity_with_flags(int flags)
+{
+  double z;
+  raise_fpsw_flags(flags);
+  PUT_BITS_DP64((unsigned long long)(BIASEDEMAX_DP64 + 1) << EXPSHIFTBITS_DP64, z);
+  return z;
+}
+#endif /* USE_INFINITY_WITH_FLAGS */
+
+#ifdef USE_INFINITYF_WITH_FLAGS
+/* Returns a positive float infinity after raising the given flags,
+   e.g.  infinityf_with_flags(AMD_F_OVERFLOW);
+*/
+static inline float infinityf_with_flags(int flags)
+{
+  float z;
+  raise_fpsw_flags(flags);
+  PUT_BITS_SP32((BIASEDEMAX_SP32 + 1) << EXPSHIFTBITS_SP32, z);
+  return z;
+}
+#endif /* USE_INFINITYF_WITH_FLAGS */
+
+
+#if defined(USE_SPLITEXP)
+/* Compute the values m, z1, and z2 such that base**x = 2**m * (z1 + z2).
+   Small arguments abs(x) < 1/(16*ln(base)) and extreme arguments
+   abs(x) > large/(ln(base)) (where large is the largest representable
+   floating point number) should be handled separately instead of calling
+   this function. This function is called by exp_amd, exp2_amd, exp10_amd,
+   cosh_amd and sinh_amd. */
+static inline void splitexp(double x, double logbase,
+                            double thirtytwo_by_logbaseof2,
+                            double logbaseof2_by_32_lead,
+                            double logbaseof2_by_32_trail,
+                            int *m, double *z1, double *z2)
+{
+  double q, r, r1, r2, f1, f2;
+  int n, j;
+
+/* Arrays two_to_jby32_lead_table and two_to_jby32_trail_table contain
+   leading and trailing parts respectively of precomputed
+   values of pow(2.0,j/32.0), for j = 0, 1, ..., 31.
+   two_to_jby32_lead_table contains the first 25 bits of precision,
+   and two_to_jby32_trail_table contains a further 53 bits precision. */
+
+  static const double two_to_jby32_lead_table[32] = {
+    1.00000000000000000000e+00,   /* 0x3ff0000000000000 */
+    1.02189713716506958008e+00,   /* 0x3ff059b0d0000000 */
+    1.04427373409271240234e+00,   /* 0x3ff0b55860000000 */
+    1.06714040040969848633e+00,   /* 0x3ff11301d0000000 */
+    1.09050768613815307617e+00,   /* 0x3ff172b830000000 */
+    1.11438673734664916992e+00,   /* 0x3ff1d48730000000 */
+    1.13878858089447021484e+00,   /* 0x3ff2387a60000000 */
+    1.16372483968734741211e+00,   /* 0x3ff29e9df0000000 */
+    1.18920707702636718750e+00,   /* 0x3ff306fe00000000 */
+    1.21524733304977416992e+00,   /* 0x3ff371a730000000 */
+    1.24185776710510253906e+00,   /* 0x3ff3dea640000000 */
+    1.26905095577239990234e+00,   /* 0x3ff44e0860000000 */
+    1.29683953523635864258e+00,   /* 0x3ff4bfdad0000000 */
+    1.32523661851882934570e+00,   /* 0x3ff5342b50000000 */
+    1.35425549745559692383e+00,   /* 0x3ff5ab07d0000000 */
+    1.38390988111495971680e+00,   /* 0x3ff6247eb0000000 */
+    1.41421353816986083984e+00,   /* 0x3ff6a09e60000000 */
+    1.44518077373504638672e+00,   /* 0x3ff71f75e0000000 */
+    1.47682613134384155273e+00,   /* 0x3ff7a11470000000 */
+    1.50916439294815063477e+00,   /* 0x3ff8258990000000 */
+    1.54221081733703613281e+00,   /* 0x3ff8ace540000000 */
+    1.57598084211349487305e+00,   /* 0x3ff93737b0000000 */
+    1.61049032211303710938e+00,   /* 0x3ff9c49180000000 */
+    1.64575546979904174805e+00,   /* 0x3ffa5503b0000000 */
+    1.68179279565811157227e+00,   /* 0x3ffae89f90000000 */
+    1.71861928701400756836e+00,   /* 0x3ffb7f76f0000000 */
+    1.75625211000442504883e+00,   /* 0x3ffc199bd0000000 */
+    1.79470902681350708008e+00,   /* 0x3ffcb720d0000000 */
+    1.83400803804397583008e+00,   /* 0x3ffd5818d0000000 */
+    1.87416762113571166992e+00,   /* 0x3ffdfc9730000000 */
+    1.91520655155181884766e+00,   /* 0x3ffea4afa0000000 */
+    1.95714408159255981445e+00};  /* 0x3fff507650000000 */
+
+  static const double two_to_jby32_trail_table[32] = {
+    0.00000000000000000000e+00,   /* 0x0000000000000000 */
+    1.14890470981563546737e-08,   /* 0x3e48ac2ba1d73e2a */
+    4.83347014379782142328e-08,   /* 0x3e69f3121ec53172 */
+    2.67125131841396124714e-10,   /* 0x3df25b50a4ebbf1b */
+    4.65271045830351350190e-08,   /* 0x3e68faa2f5b9bef9 */
+    5.24924336638693782574e-09,   /* 0x3e368b9aa7805b80 */
+    5.38622214388600821910e-08,   /* 0x3e6ceac470cd83f6 */
+    1.90902301017041969782e-08,   /* 0x3e547f7b84b09745 */
+    3.79763538792174980894e-08,   /* 0x3e64636e2a5bd1ab */
+    2.69306947081946450986e-08,   /* 0x3e5ceaa72a9c5154 */
+    4.49683815095311756138e-08,   /* 0x3e682468446b6824 */
+    1.41933332021066904914e-09,   /* 0x3e18624b40c4dbd0 */
+    1.94146510233556266402e-08,   /* 0x3e54d8a89c750e5e */
+    2.46409119489264118569e-08,   /* 0x3e5a753e077c2a0f */
+    4.94812958044698886494e-08,   /* 0x3e6a90a852b19260 */
+    8.48872238075784476136e-10,   /* 0x3e0d2ac258f87d03 */
+    2.42032342089579394887e-08,   /* 0x3e59fcef32422cbf */
+    3.32420002333182569170e-08,   /* 0x3e61d8bee7ba46e2 */
+    1.45956577586525322754e-08,   /* 0x3e4f580c36bea881 */
+    3.46452721050003920866e-08,   /* 0x3e62999c25159f11 */
+    8.07090469079979051284e-09,   /* 0x3e415506dadd3e2a */
+    2.99439161340839520436e-09,   /* 0x3e29b8bc9e8a0388 */
+    9.83621719880452147153e-09,   /* 0x3e451f8480e3e236 */
+    8.35492309647188080486e-09,   /* 0x3e41f12ae45a1224 */
+    3.48493175137966283582e-08,   /* 0x3e62b5a75abd0e6a */
+    1.11084703472699692902e-08,   /* 0x3e47daf237553d84 */
+    5.03688744342840346564e-08,   /* 0x3e6b0aa538444196 */
+    4.81896001063495806249e-08,   /* 0x3e69df20d22a0798 */
+    4.83653666334089557746e-08,   /* 0x3e69f7490e4bb40b */
+    1.29745882314081237628e-08,   /* 0x3e4bdcdaf5cb4656 */
+    9.84532844621636118964e-09,   /* 0x3e452486cc2c7b9d */
+    4.25828404545651943883e-08};  /* 0x3e66dc8a80ce9f09 */
+
+    /*
+      Step 1. Reduce the argument.
+
+      To perform argument reduction, we find the integer n such that
+      x = n * logbaseof2/32 + remainder, |remainder| <= logbaseof2/64.
+      n is defined by round-to-nearest-integer( x*32/logbaseof2 ) and
+      remainder by x - n*logbaseof2/32. The calculation of n is
+      straightforward whereas the computation of x - n*logbaseof2/32
+      must be carried out carefully.
+      logbaseof2/32 is so represented in two pieces that
+      (1) logbaseof2/32 is known to extra precision, (2) the product
+      of n and the leading piece is a model number and is hence
+      calculated without error, and (3) the subtraction of the value
+      obtained in (2) from x is a model number and is hence again
+      obtained without error.
+    */
+
+    r = x * thirtytwo_by_logbaseof2;
+    /* Set n = nearest integer to r */
+    /* This is faster on Hammer */
+    if (r > 0)
+      n = (int)(r + 0.5);
+    else
+      n = (int)(r - 0.5);
+
+    r1 = x - n * logbaseof2_by_32_lead;
+    r2 =   - n * logbaseof2_by_32_trail;
+
+    /* Set j = n mod 32:   5 mod 32 = 5,   -5 mod 32 = 27,  etc. */
+    /* j = n % 32;
+       if (j < 0) j += 32; */
+    j = n & 0x0000001f;
+
+    f1 = two_to_jby32_lead_table[j];
+    f2 = two_to_jby32_trail_table[j];
+
+    *m = (n - j) / 32;
+
+    /* Step 2. The following is the core approximation. We approximate
+       exp(r1+r2)-1 by a polynomial. */
+
+    r1 *= logbase; r2 *= logbase;
+
+    r = r1 + r2;
+    q = r1 + (r2 +
+              r*r*( 5.00000000000000008883e-01 +
+                      r*( 1.66666666665260878863e-01 +
+                      r*( 4.16666666662260795726e-02 +
+                      r*( 8.33336798434219616221e-03 +
+                      r*( 1.38889490863777199667e-03 ))))));
+
+    /* Step 3. Function value reconstruction.
+       We now reconstruct the exponential of the input argument
+       so that exp(x) = 2**m * (z1 + z2).
+       The order of the computation below must be strictly observed. */
+
+    *z1 = f1;
+    *z2 = f2 + ((f1 + f2) * q);
+}
+#endif /* USE_SPLITEXP */
+
+
+#if defined(USE_SPLITEXPF)
+/* Compute the values m, z1, and z2 such that base**x = 2**m * (z1 + z2).
+   Small arguments abs(x) < 1/(16*ln(base)) and extreme arguments
+   abs(x) > large/(ln(base)) (where large is the largest representable
+   floating point number) should be handled separately instead of calling
+   this function. This function is called by exp_amd, exp2_amd, exp10_amd,
+   cosh_amd and sinh_amd. */
+static inline void splitexpf(float x, float logbase,
+                             float thirtytwo_by_logbaseof2,
+                             float logbaseof2_by_32_lead,
+                             float logbaseof2_by_32_trail,
+                             int *m, float *z1, float *z2)
+{
+  float q, r, r1, r2, f1, f2;
+  int n, j;
+
+/* Arrays two_to_jby32_lead_table and two_to_jby32_trail_table contain
+   leading and trailing parts respectively of precomputed
+   values of pow(2.0,j/32.0), for j = 0, 1, ..., 31.
+   two_to_jby32_lead_table contains the first 10 bits of precision,
+   and two_to_jby32_trail_table contains a further 24 bits precision. */
+
+  static const float two_to_jby32_lead_table[32] = {
+    1.0000000000E+00F,  /* 0x3F800000 */
+    1.0214843750E+00F,  /* 0x3F82C000 */
+    1.0429687500E+00F,  /* 0x3F858000 */
+    1.0664062500E+00F,  /* 0x3F888000 */
+    1.0898437500E+00F,  /* 0x3F8B8000 */
+    1.1132812500E+00F,  /* 0x3F8E8000 */
+    1.1386718750E+00F,  /* 0x3F91C000 */
+    1.1621093750E+00F,  /* 0x3F94C000 */
+    1.1875000000E+00F,  /* 0x3F980000 */
+    1.2148437500E+00F,  /* 0x3F9B8000 */
+    1.2402343750E+00F,  /* 0x3F9EC000 */
+    1.2675781250E+00F,  /* 0x3FA24000 */
+    1.2949218750E+00F,  /* 0x3FA5C000 */
+    1.3242187500E+00F,  /* 0x3FA98000 */
+    1.3535156250E+00F,  /* 0x3FAD4000 */
+    1.3828125000E+00F,  /* 0x3FB10000 */
+    1.4140625000E+00F,  /* 0x3FB50000 */
+    1.4433593750E+00F,  /* 0x3FB8C000 */
+    1.4765625000E+00F,  /* 0x3FBD0000 */
+    1.5078125000E+00F,  /* 0x3FC10000 */
+    1.5410156250E+00F,  /* 0x3FC54000 */
+    1.5742187500E+00F,  /* 0x3FC98000 */
+    1.6093750000E+00F,  /* 0x3FCE0000 */
+    1.6445312500E+00F,  /* 0x3FD28000 */
+    1.6816406250E+00F,  /* 0x3FD74000 */
+    1.7167968750E+00F,  /* 0x3FDBC000 */
+    1.7558593750E+00F,  /* 0x3FE0C000 */
+    1.7929687500E+00F,  /* 0x3FE58000 */
+    1.8339843750E+00F,  /* 0x3FEAC000 */
+    1.8730468750E+00F,  /* 0x3FEFC000 */
+    1.9140625000E+00F,  /* 0x3FF50000 */
+    1.9570312500E+00F}; /* 0x3FFA8000 */
+
+  static const float two_to_jby32_trail_table[32] = {
+    0.0000000000E+00F,  /* 0x00000000 */
+    4.1277357377E-04F,  /* 0x39D86988 */
+    1.3050324051E-03F,  /* 0x3AAB0D9F */
+    7.3415064253E-04F,  /* 0x3A407404 */
+    6.6398258787E-04F,  /* 0x3A2E0F1E */
+    1.1054925853E-03F,  /* 0x3A90E62D */
+    1.1675967835E-04F,  /* 0x38F4DCE0 */
+    1.6154836630E-03F,  /* 0x3AD3BEA3 */
+    1.7071149778E-03F,  /* 0x3ADFC146 */
+    4.0360994171E-04F,  /* 0x39D39B9C */
+    1.6234370414E-03F,  /* 0x3AD4C982 */
+    1.4728321694E-03F,  /* 0x3AC10C0C */
+    1.9176795613E-03F,  /* 0x3AFB5AA6 */
+    1.0178930825E-03F,  /* 0x3A856AD3 */
+    7.3992193211E-04F,  /* 0x3A41F752 */
+    1.0973819299E-03F,  /* 0x3A8FD607 */
+    1.5106226783E-04F,  /* 0x391E6678 */
+    1.8214319134E-03F,  /* 0x3AEEBD1D */
+    2.6364589576E-04F,  /* 0x398A39F4 */
+    1.3519275235E-03F,  /* 0x3AB13329 */
+    1.1952003697E-03F,  /* 0x3A9CA845 */
+    1.7620950239E-03F,  /* 0x3AE6F619 */
+    1.1153318919E-03F,  /* 0x3A923054 */
+    1.2242280645E-03F,  /* 0x3AA07647 */
+    1.5220546629E-04F,  /* 0x391F9958 */
+    1.8224230735E-03F,  /* 0x3AEEDE5F */
+    3.9278529584E-04F,  /* 0x39CDEEC0 */
+    1.7403248930E-03F,  /* 0x3AE41B9D */
+    2.3711356334E-05F,  /* 0x37C6E7C0 */
+    1.1207590578E-03F,  /* 0x3A92E66F */
+    1.1440613307E-03F,  /* 0x3A95F454 */
+    1.1287408415E-04F}; /* 0x38ECB6D0 */
+
+    /*
+      Step 1. Reduce the argument.
+
+      To perform argument reduction, we find the integer n such that
+      x = n * logbaseof2/32 + remainder, |remainder| <= logbaseof2/64.
+      n is defined by round-to-nearest-integer( x*32/logbaseof2 ) and
+      remainder by x - n*logbaseof2/32. The calculation of n is
+      straightforward whereas the computation of x - n*logbaseof2/32
+      must be carried out carefully.
+      logbaseof2/32 is so represented in two pieces that
+      (1) logbaseof2/32 is known to extra precision, (2) the product
+      of n and the leading piece is a model number and is hence
+      calculated without error, and (3) the subtraction of the value
+      obtained in (2) from x is a model number and is hence again
+      obtained without error.
+    */
+
+    r = x * thirtytwo_by_logbaseof2;
+    /* Set n = nearest integer to r */
+    /* This is faster on Hammer */
+    if (r > 0)
+      n = (int)(r + 0.5F);
+    else
+      n = (int)(r - 0.5F);
+
+    r1 = x - n * logbaseof2_by_32_lead;
+    r2 =   - n * logbaseof2_by_32_trail;
+
+    /* Set j = n mod 32:   5 mod 32 = 5,   -5 mod 32 = 27,  etc. */
+    /* j = n % 32;
+       if (j < 0) j += 32; */
+    j = n & 0x0000001f;
+
+    f1 = two_to_jby32_lead_table[j];
+    f2 = two_to_jby32_trail_table[j];
+
+    *m = (n - j) / 32;
+
+    /* Step 2. The following is the core approximation. We approximate
+       exp(r1+r2)-1 by a polynomial. */
+
+    r1 *= logbase; r2 *= logbase;
+
+    r = r1 + r2;
+    q = r1 + (r2 +
+              r*r*( 5.00000000000000008883e-01F +
+                      r*( 1.66666666665260878863e-01F )));
+
+    /* Step 3. Function value reconstruction.
+       We now reconstruct the exponential of the input argument
+       so that exp(x) = 2**m * (z1 + z2).
+       The order of the computation below must be strictly observed. */
+
+    *z1 = f1;
+    *z2 = f2 + ((f1 + f2) * q);
+}
+#endif /* SPLITEXPF */
+
+
+#if defined(USE_SCALEUPDOUBLE1024)
+/* Scales up a double (normal or denormal) whose bit pattern is given
+   as ux by 2**1024. There are no checks that the input number is
+   scalable by that amount. */
+static inline void scaleUpDouble1024(unsigned long long ux, unsigned long long *ur)
+{
+  unsigned long long uy;
+  double y;
+
+  if ((ux & EXPBITS_DP64) == 0)
+    {
+      /* ux is denormalised */
+      PUT_BITS_DP64(ux | 0x4010000000000000, y);
+      if (ux & SIGNBIT_DP64)
+        y += 4.0;
+      else
+        y -= 4.0;
+      GET_BITS_DP64(y, uy);
+    }
+  else
+    /* ux is normal */
+    uy = ux + 0x4000000000000000;
+
+  *ur = uy;
+  return;
+}
+
+#endif /* SCALEUPDOUBLE1024 */
+
+
+#if defined(USE_SCALEDOWNDOUBLE)
+/* Scales down a double whose bit pattern is given as ux by 2**k.
+   There are no checks that the input number is scalable by that amount. */
+static inline void scaleDownDouble(unsigned long long ux, int k,
+                                   unsigned long long *ur)
+{
+  unsigned long long uy, uk, ax, xsign;
+  int n, shift;
+  xsign = ux & SIGNBIT_DP64;
+  ax = ux & ~SIGNBIT_DP64;
+  n = (int)((ax & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - k;
+  if (n > 0)
+    {
+      uk = (unsigned long long)n << EXPSHIFTBITS_DP64;
+      uy = (ax & ~EXPBITS_DP64) | uk;
+    }
+  else
+    {
+      uy = (ax & ~EXPBITS_DP64) | 0x0010000000000000;
+      shift = (1 - n);
+      if (shift > MANTLENGTH_DP64 + 1)
+        /* Sigh. Shifting works mod 64 so be careful not to shift too much */
+        uy = 0;
+      else
+        {
+          /* Make sure we round the result */
+          uy >>= shift - 1;
+          uy = (uy >> 1) + (uy & 1);
+        }
+    }
+  *ur = uy | xsign;
+}
+
+#endif /* SCALEDOWNDOUBLE */
+
+
+#if defined(USE_SCALEUPFLOAT128)
+/* Scales up a float (normal or denormal) whose bit pattern is given
+   as ux by 2**128. There are no checks that the input number is
+   scalable by that amount. */
+static inline void scaleUpFloat128(unsigned int ux, unsigned int *ur)
+{
+  unsigned int uy;
+  float y;
+
+  if ((ux & EXPBITS_SP32) == 0)
+    {
+      /* ux is denormalised */
+      PUT_BITS_SP32(ux | 0x40800000, y);
+      /* Compensate for the implicit bit just added */
+      if (ux & SIGNBIT_SP32)
+        y += 4.0F;
+      else
+        y -= 4.0F;
+      GET_BITS_SP32(y, uy);
+    }
+  else
+    /* ux is normal */
+    uy = ux + 0x40000000;
+  *ur = uy;
+}
+#endif /* SCALEUPFLOAT128 */
+
+
+#if defined(USE_SCALEDOWNFLOAT)
+/* Scales down a float whose bit pattern is given as ux by 2**k.
+   There are no checks that the input number is scalable by that amount. */
+static inline void scaleDownFloat(unsigned int ux, int k,
+                                  unsigned int *ur)
+{
+  unsigned int uy, uk, ax, xsign;
+  int n, shift;
+
+  xsign = ux & SIGNBIT_SP32;
+  ax = ux & ~SIGNBIT_SP32;
+  n = ((ax & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - k;
+  if (n > 0)
+    {
+      uk = (unsigned int)n << EXPSHIFTBITS_SP32;
+      uy = (ax & ~EXPBITS_SP32) | uk;
+    }
+  else
+    {
+      uy = (ax & ~EXPBITS_SP32) | 0x00800000;
+      shift = (1 - n);
+      if (shift > MANTLENGTH_SP32 + 1)
+        /* Sigh. Shifting works mod 32 so be careful not to shift too much */
+        uy = 0;
+      else
+        {
+          /* Make sure we round the result */
+          uy >>= shift - 1;
+          uy = (uy >> 1) + (uy & 1);
+        }
+    }
+  *ur = uy | xsign;
+}
+#endif /* SCALEDOWNFLOAT */
+
+
+#if defined(USE_SQRT_AMD_INLINE)
+static inline double sqrt_amd_inline(double x)
+{
+  /*
+     Computes the square root of x.
+
+     The calculation is carried out in three steps.
+
+     Step 1. Reduction.
+     The input argument is scaled to the interval [1, 4) by
+     computing
+               x = 2^e * y, where y in [1,4).
+     Furthermore y is decomposed as y = c + t where
+               c = 1 + j/32, j = 0,1,..,96; and |t| <= 1/64.
+
+     Step 2. Approximation.
+     An approximation q = sqrt(1 + (t/c)) - 1  is obtained
+     from a basic series expansion using precomputed values
+     stored in rt_jby32_lead_table_dbl and rt_jby32_trail_table_dbl.
+
+     Step 3. Reconstruction.
+     The value of sqrt(x) is reconstructed via
+       sqrt(x) = 2^(e/2) * sqrt(y)
+               = 2^(e/2) * sqrt(c) * sqrt(y/c)
+               = 2^(e/2) * sqrt(c) * sqrt(1 + t/c)
+               = 2^(e/2) * [ sqrt(c) + sqrt(c)*q ]
+    */
+
+  unsigned long long ux, ax, u;
+  double r1, r2, c, y, p, q, r, twop, z, rtc, rtc_lead, rtc_trail;
+  int e, denorm = 0, index;
+
+/* Arrays rt_jby32_lead_table_dbl and rt_jby32_trail_table_dbl contain
+   leading and trailing parts respectively of precomputed
+   values of sqrt(j/32), for j = 32, 33, ..., 128.
+   rt_jby32_lead_table_dbl contains the first 21 bits of precision,
+   and rt_jby32_trail_table_dbl contains a further 53 bits precision. */
+
+  static const double rt_jby32_lead_table_dbl[97] = {
+    1.00000000000000000000e+00,   /* 0x3ff0000000000000 */
+    1.01550388336181640625e+00,   /* 0x3ff03f8100000000 */
+    1.03077602386474609375e+00,   /* 0x3ff07e0f00000000 */
+    1.04582500457763671875e+00,   /* 0x3ff0bbb300000000 */
+    1.06065940856933593750e+00,   /* 0x3ff0f87600000000 */
+    1.07528972625732421875e+00,   /* 0x3ff1346300000000 */
+    1.08972454071044921875e+00,   /* 0x3ff16f8300000000 */
+    1.10396957397460937500e+00,   /* 0x3ff1a9dc00000000 */
+    1.11803340911865234375e+00,   /* 0x3ff1e37700000000 */
+    1.13192272186279296875e+00,   /* 0x3ff21c5b00000000 */
+    1.14564323425292968750e+00,   /* 0x3ff2548e00000000 */
+    1.15920162200927734375e+00,   /* 0x3ff28c1700000000 */
+    1.17260360717773437500e+00,   /* 0x3ff2c2fc00000000 */
+    1.18585395812988281250e+00,   /* 0x3ff2f94200000000 */
+    1.19895744323730468750e+00,   /* 0x3ff32eee00000000 */
+    1.21191978454589843750e+00,   /* 0x3ff3640600000000 */
+    1.22474479675292968750e+00,   /* 0x3ff3988e00000000 */
+    1.23743629455566406250e+00,   /* 0x3ff3cc8a00000000 */
+    1.25000000000000000000e+00,   /* 0x3ff4000000000000 */
+    1.26243782043457031250e+00,   /* 0x3ff432f200000000 */
+    1.27475452423095703125e+00,   /* 0x3ff4656500000000 */
+    1.28695297241210937500e+00,   /* 0x3ff4975c00000000 */
+    1.29903793334960937500e+00,   /* 0x3ff4c8dc00000000 */
+    1.31101036071777343750e+00,   /* 0x3ff4f9e600000000 */
+    1.32287502288818359375e+00,   /* 0x3ff52a7f00000000 */
+    1.33463478088378906250e+00,   /* 0x3ff55aaa00000000 */
+    1.34629058837890625000e+00,   /* 0x3ff58a6800000000 */
+    1.35784721374511718750e+00,   /* 0x3ff5b9be00000000 */
+    1.36930561065673828125e+00,   /* 0x3ff5e8ad00000000 */
+    1.38066959381103515625e+00,   /* 0x3ff6173900000000 */
+    1.39194107055664062500e+00,   /* 0x3ff6456400000000 */
+    1.40312099456787109375e+00,   /* 0x3ff6732f00000000 */
+    1.41421318054199218750e+00,   /* 0x3ff6a09e00000000 */
+    1.42521858215332031250e+00,   /* 0x3ff6cdb200000000 */
+    1.43614006042480468750e+00,   /* 0x3ff6fa6e00000000 */
+    1.44697952270507812500e+00,   /* 0x3ff726d400000000 */
+    1.45773792266845703125e+00,   /* 0x3ff752e500000000 */
+    1.46841716766357421875e+00,   /* 0x3ff77ea300000000 */
+    1.47901916503906250000e+00,   /* 0x3ff7aa1000000000 */
+    1.48954677581787109375e+00,   /* 0x3ff7d52f00000000 */
+    1.50000000000000000000e+00,   /* 0x3ff8000000000000 */
+    1.51038074493408203125e+00,   /* 0x3ff82a8500000000 */
+    1.52068996429443359375e+00,   /* 0x3ff854bf00000000 */
+    1.53093051910400390625e+00,   /* 0x3ff87eb100000000 */
+    1.54110336303710937500e+00,   /* 0x3ff8a85c00000000 */
+    1.55120849609375000000e+00,   /* 0x3ff8d1c000000000 */
+    1.56124877929687500000e+00,   /* 0x3ff8fae000000000 */
+    1.57122516632080078125e+00,   /* 0x3ff923bd00000000 */
+    1.58113861083984375000e+00,   /* 0x3ff94c5800000000 */
+    1.59099006652832031250e+00,   /* 0x3ff974b200000000 */
+    1.60078048706054687500e+00,   /* 0x3ff99ccc00000000 */
+    1.61051177978515625000e+00,   /* 0x3ff9c4a800000000 */
+    1.62018489837646484375e+00,   /* 0x3ff9ec4700000000 */
+    1.62979984283447265625e+00,   /* 0x3ffa13a900000000 */
+    1.63935947418212890625e+00,   /* 0x3ffa3ad100000000 */
+    1.64886283874511718750e+00,   /* 0x3ffa61be00000000 */
+    1.65831184387207031250e+00,   /* 0x3ffa887200000000 */
+    1.66770744323730468750e+00,   /* 0x3ffaaeee00000000 */
+    1.67705059051513671875e+00,   /* 0x3ffad53300000000 */
+    1.68634128570556640625e+00,   /* 0x3ffafb4100000000 */
+    1.69558238983154296875e+00,   /* 0x3ffb211b00000000 */
+    1.70477199554443359375e+00,   /* 0x3ffb46bf00000000 */
+    1.71391296386718750000e+00,   /* 0x3ffb6c3000000000 */
+    1.72300529479980468750e+00,   /* 0x3ffb916e00000000 */
+    1.73204994201660156250e+00,   /* 0x3ffbb67a00000000 */
+    1.74104785919189453125e+00,   /* 0x3ffbdb5500000000 */
+    1.75000000000000000000e+00,   /* 0x3ffc000000000000 */
+    1.75890541076660156250e+00,   /* 0x3ffc247a00000000 */
+    1.76776695251464843750e+00,   /* 0x3ffc48c600000000 */
+    1.77658367156982421875e+00,   /* 0x3ffc6ce300000000 */
+    1.78535652160644531250e+00,   /* 0x3ffc90d200000000 */
+    1.79408740997314453125e+00,   /* 0x3ffcb49500000000 */
+    1.80277538299560546875e+00,   /* 0x3ffcd82b00000000 */
+    1.81142139434814453125e+00,   /* 0x3ffcfb9500000000 */
+    1.82002735137939453125e+00,   /* 0x3ffd1ed500000000 */
+    1.82859230041503906250e+00,   /* 0x3ffd41ea00000000 */
+    1.83711719512939453125e+00,   /* 0x3ffd64d500000000 */
+    1.84560203552246093750e+00,   /* 0x3ffd879600000000 */
+    1.85404872894287109375e+00,   /* 0x3ffdaa2f00000000 */
+    1.86245727539062500000e+00,   /* 0x3ffdcca000000000 */
+    1.87082862854003906250e+00,   /* 0x3ffdeeea00000000 */
+    1.87916183471679687500e+00,   /* 0x3ffe110c00000000 */
+    1.88745784759521484375e+00,   /* 0x3ffe330700000000 */
+    1.89571857452392578125e+00,   /* 0x3ffe54dd00000000 */
+    1.90394306182861328125e+00,   /* 0x3ffe768d00000000 */
+    1.91213226318359375000e+00,   /* 0x3ffe981800000000 */
+    1.92028617858886718750e+00,   /* 0x3ffeb97e00000000 */
+    1.92840576171875000000e+00,   /* 0x3ffedac000000000 */
+    1.93649101257324218750e+00,   /* 0x3ffefbde00000000 */
+    1.94454288482666015625e+00,   /* 0x3fff1cd900000000 */
+    1.95256233215332031250e+00,   /* 0x3fff3db200000000 */
+    1.96054744720458984375e+00,   /* 0x3fff5e6700000000 */
+    1.96850109100341796875e+00,   /* 0x3fff7efb00000000 */
+    1.97642326354980468750e+00,   /* 0x3fff9f6e00000000 */
+    1.98431301116943359375e+00,   /* 0x3fffbfbf00000000 */
+    1.99217128753662109375e+00,   /* 0x3fffdfef00000000 */
+    2.00000000000000000000e+00};  /* 0x4000000000000000 */
+
+  static const double rt_jby32_trail_table_dbl[97] = {
+    0.00000000000000000000e+00,   /* 0x0000000000000000 */
+    9.17217678638807524014e-07,   /* 0x3eaec6d70177881c */
+    3.82539669043705364790e-07,   /* 0x3e99abfb41bd6b24 */
+    2.85899577162227138140e-08,   /* 0x3e5eb2bf6bab55a2 */
+    7.63210485349101216659e-07,   /* 0x3ea99bed9b2d8d0c */
+    9.32123004127716212874e-07,   /* 0x3eaf46e029c1b296 */
+    1.95174719169309219157e-07,   /* 0x3e8a3226fc42f30c */
+    5.34316371481845492427e-07,   /* 0x3ea1edbe20701d73 */
+    5.79631242504454563052e-07,   /* 0x3ea372fe94f82be7 */
+    4.20404384109571705948e-07,   /* 0x3e9c367e08e7bb06 */
+    6.89486030314147010716e-07,   /* 0x3ea722a3d0a66608 */
+    6.89927685625314560328e-07,   /* 0x3ea7266f067ca1d6 */
+    3.32778123013641425828e-07,   /* 0x3e965515a9b34850 */
+    1.64433259436999584387e-07,   /* 0x3e8611e23ef6c1bd */
+    4.37590875197899335723e-07,   /* 0x3e9d5dc1059ed8e7 */
+    1.79808183816018617413e-07,   /* 0x3e88222982d0e4f4 */
+    7.46386593615986477624e-08,   /* 0x3e7409212e7d0322 */
+    5.72520794105201454728e-07,   /* 0x3ea335ea8a5fcf39 */
+    0.00000000000000000000e+00,   /* 0x0000000000000000 */
+    2.96860689431670420344e-07,   /* 0x3e93ec071e938bfe */
+    3.54167239176257065345e-07,   /* 0x3e97c48bfd9862c6 */
+    7.95211265664474710063e-07,   /* 0x3eaaaed010f74671 */
+    1.72327048595145565621e-07,   /* 0x3e87211cbfeb62e0 */
+    6.99494915996239297020e-07,   /* 0x3ea7789d9660e72d */
+    6.32644111701500844315e-07,   /* 0x3ea53a5f1d36f1cf */
+    6.20124838851440463844e-10,   /* 0x3e054eacff2057dc */
+    6.13404719757812629969e-07,   /* 0x3ea4951b3e6a83cc */
+    3.47654909777986407387e-07,   /* 0x3e9754aa76884c66 */
+    7.83106177002392475763e-07,   /* 0x3eaa46d4b1de1074 */
+    5.33337372440526357008e-07,   /* 0x3ea1e55548f92635 */
+    2.01508648555298681765e-08,   /* 0x3e55a3070dd17788 */
+    5.25472356925843939587e-07,   /* 0x3ea1a1c5eedb0801 */
+    3.81831102861301692797e-07,   /* 0x3e999fcef32422cc */
+    6.99220602161420018738e-07,   /* 0x3ea776425d6b0199 */
+    6.01209702477462624811e-07,   /* 0x3ea42c5a1e0191a2 */
+    9.01437000591944740554e-08,   /* 0x3e7832a0bdff1327 */
+    5.10428680864685379950e-08,   /* 0x3e6b674743636676 */
+    3.47895267104621031421e-07,   /* 0x3e9758cb90d2f714 */
+    7.80735841510641848628e-07,   /* 0x3eaa3278459cde25 */
+    1.35158752025506517690e-07,   /* 0x3e822404f4a103ee */
+    0.00000000000000000000e+00,   /* 0x0000000000000000 */
+    1.76523947728535489812e-09,   /* 0x3e1e539af6892ac5 */
+    6.68280121328499932183e-07,   /* 0x3ea66c7b872c9cd0 */
+    5.70135482405123276616e-07,   /* 0x3ea3216d2f43887d */
+    1.37705134737562525897e-07,   /* 0x3e827b832cbedc0e */
+    7.09655107074516613672e-07,   /* 0x3ea7cfe41579091d */
+    7.20302724551461693011e-07,   /* 0x3ea82b5a713c490a */
+    4.69926266058212796694e-07,   /* 0x3e9f8945932d872e */
+    2.19244345915999437026e-07,   /* 0x3e8d6d2da9490251 */
+    1.91141411617401877927e-07,   /* 0x3e89a791a3114e4a */
+    5.72297665296622053774e-07,   /* 0x3ea333ffe005988d */
+    5.61055484436830560103e-07,   /* 0x3ea2d36e0ed49ab1 */
+    2.76225500213991506100e-07,   /* 0x3e92898498f55f9e */
+    7.58466189522395692908e-07,   /* 0x3ea9732cca1032a3 */
+    1.56893371256836029827e-07,   /* 0x3e850ed0b02a22d2 */
+    4.06038997708867066507e-07,   /* 0x3e9b3fb265b1e40a */
+    5.51305629612057435809e-07,   /* 0x3ea27fade682d1de */
+    5.64778487026561123207e-07,   /* 0x3ea2f36906f707ba */
+    3.92609705553556897517e-07,   /* 0x3e9a58fbbee883b6 */
+    9.09698438776943827802e-07,   /* 0x3eae864005bca6d7 */
+    1.05949774066016139743e-07,   /* 0x3e7c70d02300f263 */
+    7.16578798392844784244e-07,   /* 0x3ea80b5d712d8e3e */
+    6.86233073531233972561e-07,   /* 0x3ea706b27cc7d390 */
+    7.99211473033494452908e-07,   /* 0x3eaad12c9d849a97 */
+    8.65552275731027456121e-07,   /* 0x3ead0b09954e764b */
+    6.75456120386058448618e-07,   /* 0x3ea6aa1fb7826cbd */
+    0.00000000000000000000e+00,   /* 0x0000000000000000 */
+    4.99167184520462138743e-07,   /* 0x3ea0bfd03f46763c */
+    4.51720373502110930296e-10,   /* 0x3dff0abfb4adfb9e */
+    1.28874162718371367439e-07,   /* 0x3e814c151f991b2e */
+    5.85529267186999798656e-07,   /* 0x3ea3a5a879b09292 */
+    1.01827770937125531924e-07,   /* 0x3e7b558d173f9796 */
+    2.54736389177809626508e-07,   /* 0x3e9118567cd83fb8 */
+    6.98925535290464831294e-07,   /* 0x3ea773b981896751 */
+    1.20940735036524314513e-07,   /* 0x3e803b7df49f48a8 */
+    5.43759351196479689657e-08,   /* 0x3e6d315f22491900 */
+    1.11957989042397958409e-07,   /* 0x3e7e0db1c5bb84b2 */
+    8.47006714134442661218e-07,   /* 0x3eac6bbb7644ff76 */
+    8.92831044643427836228e-07,   /* 0x3eadf55c3afec01f */
+    7.77828292464916501663e-07,   /* 0x3eaa197e81034da3 */
+    6.48469316302918797451e-08,   /* 0x3e71683f4920555d */
+    2.12579816658859849140e-07,   /* 0x3e8c882fd78bb0b0 */
+    7.61222472580559138435e-07,   /* 0x3ea98ad9eb7b83ec */
+    2.86488961857314189607e-07,   /* 0x3e9339d7c7777273 */
+    2.14637363790165363515e-07,   /* 0x3e8ccee237cae6fe */
+    5.44137005612605847831e-08,   /* 0x3e6d368fe324a146 */
+    2.58378284856442408413e-07,   /* 0x3e9156e7b6d99b45 */
+    3.15848939061134843091e-07,   /* 0x3e95323e5310b5c1 */
+    6.60530466255089632309e-07,   /* 0x3ea629e9db362f5d */
+    7.63436345535852301127e-07,   /* 0x3ea99dde4728d7ec */
+    8.68233432860324345268e-08,   /* 0x3e774e746878544d */
+    9.45465175398023087082e-07,   /* 0x3eafb97be873a87d */
+    8.77499534786171267246e-07,   /* 0x3ead71a9e23c2f63 */
+    2.74055432394999316135e-07,   /* 0x3e92643c89cda173 */
+    4.72129009349126213532e-07,   /* 0x3e9faf1d57a4d56c */
+    8.93777032327078947306e-07,   /* 0x3eadfd7c7ab7b282 */
+    0.00000000000000000000e+00};  /* 0x0000000000000000 */
+
+
+  /* Handle special arguments first */
+
+  GET_BITS_DP64(x, ux);
+  ax = ux & (~SIGNBIT_DP64);
+
+  if(ax >= 0x7ff0000000000000)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_DP64)
+        /* x is NaN */
+        return x + x; /* Raise invalid if it is a signalling NaN */
+      else if (ux & SIGNBIT_DP64)
+        /* x is negative infinity */
+        return nan_with_flags(AMD_F_INVALID);
+      else
+        /* x is positive infinity */
+        return x;
+    }
+  else if (ux & SIGNBIT_DP64)
+    {
+      /* x is negative. */
+      if (ux == SIGNBIT_DP64)
+        /* Handle negative zero first */
+        return x;
+      else
+        return nan_with_flags(AMD_F_INVALID);
+    }
+  else if (ux <= 0x000fffffffffffff)
+    {
+      /* x is denormalised or zero */
+      if (ux == 0)
+        /* x is zero */
+        return x;
+      else
+        {
+          /* x is denormalised; scale it up */
+          /* Normalize x by increasing the exponent by 60
+             and subtracting a correction to account for the implicit
+             bit. This replaces a slow denormalized
+             multiplication by a fast normal subtraction. */
+          static const double corr = 2.5653355008114851558350183e-290; /* 0x03d0000000000000 */
+          denorm = 1;
+          GET_BITS_DP64(x, ux);
+          PUT_BITS_DP64(ux | 0x03d0000000000000, x);
+          x -= corr;
+          GET_BITS_DP64(x, ux);
+        }
+    }
+
+  /* Main algorithm */
+
+  /*
+     Find y and e such that x = 2^e * y, where y in [1,4).
+     This is done using an in-lined variant of splitDouble,
+     which also ensures that e is even.
+   */
+  y = x;
+  ux &= EXPBITS_DP64;
+  ux >>= EXPSHIFTBITS_DP64;
+  if (ux & 1)
+    {
+      GET_BITS_DP64(y, u);
+      u &= (SIGNBIT_DP64 | MANTBITS_DP64);
+      u |= ONEEXPBITS_DP64;
+      PUT_BITS_DP64(u, y);
+      e = ux - EXPBIAS_DP64;
+    }
+  else
+    {
+      GET_BITS_DP64(y, u);
+      u &= (SIGNBIT_DP64 | MANTBITS_DP64);
+      u |= TWOEXPBITS_DP64;
+      PUT_BITS_DP64(u, y);
+      e = ux - EXPBIAS_DP64 - 1;
+    }
+
+
+  /* Find the index of the sub-interval of [1,4) in which y lies. */
+
+  index = (int)(32.0*y+0.5);
+
+  /* Look up the table values and compute c and r = c/t */
+
+  rtc_lead = rt_jby32_lead_table_dbl[index-32];
+  rtc_trail = rt_jby32_trail_table_dbl[index-32];
+  c = 0.03125*index;
+  r = (y - c)/c;
+
+  /*
+    Find q = sqrt(1+r) - 1.
+    From one step of Newton on (q+1)^2 = 1+r
+  */
+
+  p = r*0.5 - r*r*(0.1250079870 - r*(0.6250522999E-01));
+  twop = p + p;
+  q = p - (p*p + (twop - r))/(twop + 2.0);
+
+  /* Reconstruction */
+
+  rtc = rtc_lead + rtc_trail;
+  e >>= 1; /* e = e/2 */
+  z = rtc_lead + (rtc*q+rtc_trail);
+
+  if (denorm)
+    {
+      /* Scale by 2**(e-30) */
+      PUT_BITS_DP64(((long long)(e - 30) + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, r);
+      z *= r;
+    }
+  else
+    {
+      /* Scale by 2**e */
+      PUT_BITS_DP64(((long long)e + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, r);
+      z *= r;
+    }
+
+  return z;
+
+}
+#endif /* SQRT_AMD_INLINE */
+
+#if defined(USE_SQRTF_AMD_INLINE)
+
+static inline float sqrtf_amd_inline(float x)
+{
+  /*
+     Computes the square root of x.
+
+     The calculation is carried out in three steps.
+
+     Step 1. Reduction.
+     The input argument is scaled to the interval [1, 4) by
+     computing
+               x = 2^e * y, where y in [1,4).
+     Furthermore y is decomposed as y = c + t where
+               c = 1 + j/32, j = 0,1,..,96; and |t| <= 1/64.
+
+     Step 2. Approximation.
+     An approximation q = sqrt(1 + (t/c)) - 1  is obtained
+     from a basic series expansion using precomputed values
+     stored in rt_jby32_lead_table_float and rt_jby32_trail_table_float.
+
+     Step 3. Reconstruction.
+     The value of sqrt(x) is reconstructed via
+       sqrt(x) = 2^(e/2) * sqrt(y)
+               = 2^(e/2) * sqrt(c) * sqrt(y/c)
+               = 2^(e/2) * sqrt(c) * sqrt(1 + t/c)
+               = 2^(e/2) * [ sqrt(c) + sqrt(c)*q ]
+    */
+
+  unsigned int ux, ax, u;
+  float r1, r2, c, y, p, q, r, twop, z, rtc, rtc_lead, rtc_trail;
+  int e, denorm = 0, index;
+
+/* Arrays rt_jby32_lead_table_float and rt_jby32_trail_table_float contain
+   leading and trailing parts respectively of precomputed
+   values of sqrt(j/32), for j = 32, 33, ..., 128.
+   rt_jby32_lead_table_float contains the first 13 bits of precision,
+   and rt_jby32_trail_table_float contains a further 24 bits precision. */
+
+static const float rt_jby32_lead_table_float[97] = {
+    1.00000000000000000000e+00F,   /* 0x3f800000 */
+    1.01538085937500000000e+00F,   /* 0x3f81f800 */
+    1.03076171875000000000e+00F,   /* 0x3f83f000 */
+    1.04565429687500000000e+00F,   /* 0x3f85d800 */
+    1.06054687500000000000e+00F,   /* 0x3f87c000 */
+    1.07519531250000000000e+00F,   /* 0x3f89a000 */
+    1.08959960937500000000e+00F,   /* 0x3f8b7800 */
+    1.10375976562500000000e+00F,   /* 0x3f8d4800 */
+    1.11791992187500000000e+00F,   /* 0x3f8f1800 */
+    1.13183593750000000000e+00F,   /* 0x3f90e000 */
+    1.14550781250000000000e+00F,   /* 0x3f92a000 */
+    1.15917968750000000000e+00F,   /* 0x3f946000 */
+    1.17236328125000000000e+00F,   /* 0x3f961000 */
+    1.18579101562500000000e+00F,   /* 0x3f97c800 */
+    1.19873046875000000000e+00F,   /* 0x3f997000 */
+    1.21191406250000000000e+00F,   /* 0x3f9b2000 */
+    1.22460937500000000000e+00F,   /* 0x3f9cc000 */
+    1.23730468750000000000e+00F,   /* 0x3f9e6000 */
+    1.25000000000000000000e+00F,   /* 0x3fa00000 */
+    1.26220703125000000000e+00F,   /* 0x3fa19000 */
+    1.27465820312500000000e+00F,   /* 0x3fa32800 */
+    1.28686523437500000000e+00F,   /* 0x3fa4b800 */
+    1.29882812500000000000e+00F,   /* 0x3fa64000 */
+    1.31079101562500000000e+00F,   /* 0x3fa7c800 */
+    1.32275390625000000000e+00F,   /* 0x3fa95000 */
+    1.33447265625000000000e+00F,   /* 0x3faad000 */
+    1.34619140625000000000e+00F,   /* 0x3fac5000 */
+    1.35766601562500000000e+00F,   /* 0x3fadc800 */
+    1.36914062500000000000e+00F,   /* 0x3faf4000 */
+    1.38061523437500000000e+00F,   /* 0x3fb0b800 */
+    1.39184570312500000000e+00F,   /* 0x3fb22800 */
+    1.40307617187500000000e+00F,   /* 0x3fb39800 */
+    1.41406250000000000000e+00F,   /* 0x3fb50000 */
+    1.42504882812500000000e+00F,   /* 0x3fb66800 */
+    1.43603515625000000000e+00F,   /* 0x3fb7d000 */
+    1.44677734375000000000e+00F,   /* 0x3fb93000 */
+    1.45751953125000000000e+00F,   /* 0x3fba9000 */
+    1.46826171875000000000e+00F,   /* 0x3fbbf000 */
+    1.47900390625000000000e+00F,   /* 0x3fbd5000 */
+    1.48950195312500000000e+00F,   /* 0x3fbea800 */
+    1.50000000000000000000e+00F,   /* 0x3fc00000 */
+    1.51025390625000000000e+00F,   /* 0x3fc15000 */
+    1.52050781250000000000e+00F,   /* 0x3fc2a000 */
+    1.53076171875000000000e+00F,   /* 0x3fc3f000 */
+    1.54101562500000000000e+00F,   /* 0x3fc54000 */
+    1.55102539062500000000e+00F,   /* 0x3fc68800 */
+    1.56103515625000000000e+00F,   /* 0x3fc7d000 */
+    1.57104492187500000000e+00F,   /* 0x3fc91800 */
+    1.58105468750000000000e+00F,   /* 0x3fca6000 */
+    1.59082031250000000000e+00F,   /* 0x3fcba000 */
+    1.60058593750000000000e+00F,   /* 0x3fcce000 */
+    1.61035156250000000000e+00F,   /* 0x3fce2000 */
+    1.62011718750000000000e+00F,   /* 0x3fcf6000 */
+    1.62963867187500000000e+00F,   /* 0x3fd09800 */
+    1.63916015625000000000e+00F,   /* 0x3fd1d000 */
+    1.64868164062500000000e+00F,   /* 0x3fd30800 */
+    1.65820312500000000000e+00F,   /* 0x3fd44000 */
+    1.66748046875000000000e+00F,   /* 0x3fd57000 */
+    1.67700195312500000000e+00F,   /* 0x3fd6a800 */
+    1.68627929687500000000e+00F,   /* 0x3fd7d800 */
+    1.69555664062500000000e+00F,   /* 0x3fd90800 */
+    1.70458984375000000000e+00F,   /* 0x3fda3000 */
+    1.71386718750000000000e+00F,   /* 0x3fdb6000 */
+    1.72290039062500000000e+00F,   /* 0x3fdc8800 */
+    1.73193359375000000000e+00F,   /* 0x3fddb000 */
+    1.74096679687500000000e+00F,   /* 0x3fded800 */
+    1.75000000000000000000e+00F,   /* 0x3fe00000 */
+    1.75878906250000000000e+00F,   /* 0x3fe12000 */
+    1.76757812500000000000e+00F,   /* 0x3fe24000 */
+    1.77636718750000000000e+00F,   /* 0x3fe36000 */
+    1.78515625000000000000e+00F,   /* 0x3fe48000 */
+    1.79394531250000000000e+00F,   /* 0x3fe5a000 */
+    1.80273437500000000000e+00F,   /* 0x3fe6c000 */
+    1.81127929687500000000e+00F,   /* 0x3fe7d800 */
+    1.81982421875000000000e+00F,   /* 0x3fe8f000 */
+    1.82836914062500000000e+00F,   /* 0x3fea0800 */
+    1.83691406250000000000e+00F,   /* 0x3feb2000 */
+    1.84545898437500000000e+00F,   /* 0x3fec3800 */
+    1.85400390625000000000e+00F,   /* 0x3fed5000 */
+    1.86230468750000000000e+00F,   /* 0x3fee6000 */
+    1.87060546875000000000e+00F,   /* 0x3fef7000 */
+    1.87915039062500000000e+00F,   /* 0x3ff08800 */
+    1.88745117187500000000e+00F,   /* 0x3ff19800 */
+    1.89550781250000000000e+00F,   /* 0x3ff2a000 */
+    1.90380859375000000000e+00F,   /* 0x3ff3b000 */
+    1.91210937500000000000e+00F,   /* 0x3ff4c000 */
+    1.92016601562500000000e+00F,   /* 0x3ff5c800 */
+    1.92822265625000000000e+00F,   /* 0x3ff6d000 */
+    1.93627929687500000000e+00F,   /* 0x3ff7d800 */
+    1.94433593750000000000e+00F,   /* 0x3ff8e000 */
+    1.95239257812500000000e+00F,   /* 0x3ff9e800 */
+    1.96044921875000000000e+00F,   /* 0x3ffaf000 */
+    1.96826171875000000000e+00F,   /* 0x3ffbf000 */
+    1.97631835937500000000e+00F,   /* 0x3ffcf800 */
+    1.98413085937500000000e+00F,   /* 0x3ffdf800 */
+    1.99194335937500000000e+00F,   /* 0x3ffef800 */
+    2.00000000000000000000e+00F};  /* 0x40000000 */
+
+static const float rt_jby32_trail_table_float[97] = {
+    0.00000000000000000000e+00F,   /* 0x00000000 */
+    1.23941208585165441036e-04F,   /* 0x3901f637 */
+    1.46876545841223560274e-05F,   /* 0x37766aff */
+    1.70736297150142490864e-04F,   /* 0x393307ad */
+    1.13296780909877270460e-04F,   /* 0x38ed99bf */
+    9.53458802541717886925e-05F,   /* 0x38c7f46e */
+    1.25126505736261606216e-04F,   /* 0x39033464 */
+    2.10342666832730174065e-04F,   /* 0x395c8f6e */
+    1.14066875539720058441e-04F,   /* 0x38ef3730 */
+    8.72047676239162683487e-05F,   /* 0x38b6e1b4 */
+    1.36111237225122749805e-04F,   /* 0x390eb915 */
+    2.26244374061934649944e-05F,   /* 0x37bdc99c */
+    2.40658700931817293167e-04F,   /* 0x397c5954 */
+    6.31069415248930454254e-05F,   /* 0x38845848 */
+    2.27412077947519719601e-04F,   /* 0x396e7577 */
+    5.90185391047270968556e-06F,   /* 0x36c6088a */
+    1.35496389702893793583e-04F,   /* 0x390e1409 */
+    1.32179571664892137051e-04F,   /* 0x390a99af */
+    0.00000000000000000000e+00F,   /* 0x00000000 */
+    2.31086043640971183777e-04F,   /* 0x39724fb0 */
+    9.66752704698592424393e-05F,   /* 0x38cabe24 */
+    8.85332483449019491673e-05F,   /* 0x38b9aaed */
+    2.09980673389509320259e-04F,   /* 0x395c2e42 */
+    2.20044588786549866199e-04F,   /* 0x3966bbc5 */
+    1.21749282698146998882e-04F,   /* 0x38ff53a6 */
+    1.62125259521417319775e-04F,   /* 0x392a002b */
+    9.97955357888713479042e-05F,   /* 0x38d14952 */
+    1.81545779923908412457e-04F,   /* 0x393e5d53 */
+    1.65768768056295812130e-04F,   /* 0x392dd237 */
+    5.48927710042335093021e-05F,   /* 0x38663caa */
+    9.53875860432162880898e-05F,   /* 0x38c80ad2 */
+    4.53481625299900770187e-05F,   /* 0x383e3438 */
+    1.51062369695864617825e-04F,   /* 0x391e667f */
+    1.70453247847035527229e-04F,   /* 0x3932bbb2 */
+    1.05505387182347476482e-04F,   /* 0x38dd42c6 */
+    2.02269104192964732647e-04F,   /* 0x39541833 */
+    2.18442466575652360916e-04F,   /* 0x39650db4 */
+    1.55796806211583316326e-04F,   /* 0x39235d63 */
+    1.60395247803535312414e-05F,   /* 0x37868c9e */
+    4.49578510597348213196e-05F,   /* 0x383c9120 */
+    0.00000000000000000000e+00F,   /* 0x00000000 */
+    1.26840444863773882389e-04F,   /* 0x39050079 */
+    1.82820076588541269302e-04F,   /* 0x393fb364 */
+    1.69370483490638434887e-04F,   /* 0x3931990b */
+    8.78757418831810355186e-05F,   /* 0x38b849ee */
+    1.83815121999941766262e-04F,   /* 0x3940be7f */
+    2.14343352126888930798e-04F,   /* 0x3960c15b */
+    1.80714370799250900745e-04F,   /* 0x393d7e25 */
+    8.41425862745381891727e-05F,   /* 0x38b075b5 */
+    1.69945167726837098598e-04F,   /* 0x3932334f */
+    1.95121858268976211548e-04F,   /* 0x394c99a0 */
+    1.60778334247879683971e-04F,   /* 0x3928969b */
+    6.79871009197086095810e-05F,   /* 0x388e944c */
+    1.61929419846273958683e-04F,   /* 0x3929cb99 */
+    1.99474830878898501396e-04F,   /* 0x39512a1e */
+    1.81604162207804620266e-04F,   /* 0x393e6cff */
+    1.09270178654696792364e-04F,   /* 0x38e527fb */
+    2.27539261686615645885e-04F,   /* 0x396e979b */
+    4.90300008095800876617e-05F,   /* 0x384da590 */
+    6.28985289949923753738e-05F,   /* 0x3883e864 */
+    2.58551553997676819563e-05F,   /* 0x37d8e386 */
+    1.82868374395184218884e-04F,   /* 0x393fc05b */
+    4.64625991298817098141e-05F,   /* 0x3842e0d6 */
+    1.05703387816902250051e-04F,   /* 0x38ddad13 */
+    1.17213814519345760345e-04F,   /* 0x38f5d0b0 */
+    8.17377731436863541603e-05F,   /* 0x38ab6aa2 */
+    0.00000000000000000000e+00F,   /* 0x00000000 */
+    1.16847433673683553934e-04F,   /* 0x38f50bfd */
+    1.88827965757809579372e-04F,   /* 0x3946001f */
+    2.16612941585481166840e-04F,   /* 0x39632298 */
+    2.00857131858356297016e-04F,   /* 0x39529d2d */
+    1.42199307447299361229e-04F,   /* 0x39151b56 */
+    4.12627305195201188326e-05F,   /* 0x382d1185 */
+    1.42796401632949709892e-04F,   /* 0x3915bb9e */
+    2.03253570361994206905e-04F,   /* 0x39552077 */
+    2.23214170546270906925e-04F,   /* 0x396a0e99 */
+    2.03244591830298304558e-04F,   /* 0x39551e0e */
+    1.43898156238719820976e-04F,   /* 0x3916e35e */
+    4.57155256299301981926e-05F,   /* 0x383fbeac */
+    1.53365719597786664963e-04F,   /* 0x3920d0cc */
+    2.23224633373320102692e-04F,   /* 0x396a1168 */
+    1.16566716314991936088e-05F,   /* 0x37439106 */
+    7.43694272387074306607e-06F,   /* 0x36f98ada */
+    2.11048507480882108212e-04F,   /* 0x395d4ce7 */
+    1.34682719362899661064e-04F,   /* 0x390d399e */
+    2.29425968427676707506e-05F,   /* 0x37c074da */
+    1.20421340398024767637e-04F,   /* 0x38fc8ab7 */
+    1.83421318070031702518e-04F,   /* 0x394054c9 */
+    2.12376224226318299770e-04F,   /* 0x395eb14f */
+    2.07710763788782060146e-04F,   /* 0x3959ccef */
+    1.69840845046564936638e-04F,   /* 0x3932174e */
+    9.91739216260612010956e-05F,   /* 0x38cffb98 */
+    2.40249748458154499531e-04F,   /* 0x397beb8d */
+    1.05178231024183332920e-04F,   /* 0x38dc9322 */
+    1.82623916771262884140e-04F,   /* 0x393f7ebc */
+    2.28821940254420042038e-04F,   /* 0x396fefec */
+    0.00000000000000000000e+00F};  /* 0x00000000 */
+
+
+/* Handle special arguments first */
+
+  GET_BITS_SP32(x, ux);
+  ax = ux & (~SIGNBIT_SP32);
+
+  if(ax >= 0x7f800000)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_SP32)
+        /* x is NaN */
+        return x + x; /* Raise invalid if it is a signalling NaN */
+      else if (ux & SIGNBIT_SP32)
+        return nanf_with_flags(AMD_F_INVALID);
+      else
+        /* x is positive infinity */
+        return x;
+    }
+  else if (ux & SIGNBIT_SP32)
+    {
+      /* x is negative. */
+      if (x == 0.0F)
+        /* Handle negative zero first */
+        return x;
+      else
+        return nanf_with_flags(AMD_F_INVALID);
+    }
+  else if (ux <= 0x007fffff)
+    {
+      /* x is denormalised or zero */
+      if (ux == 0)
+        /* x is zero */
+        return x;
+      else
+        {
+          /* x is denormalised; scale it up */
+          /* Normalize x by increasing the exponent by 26
+             and subtracting a correction to account for the implicit
+             bit. This replaces a slow denormalized
+             multiplication by a fast normal subtraction. */
+          static const float corr = 7.888609052210118054e-31F; /* 0x0d800000 */
+          denorm = 1;
+          GET_BITS_SP32(x, ux);
+          PUT_BITS_SP32(ux | 0x0d800000, x);
+          x -= corr;
+          GET_BITS_SP32(x, ux);
+        }
+    }
+
+  /* Main algorithm */
+
+  /*
+     Find y and e such that x = 2^e * y, where y in [1,4).
+     This is done using an in-lined variant of splitFloat,
+     which also ensures that e is even.
+   */
+  y = x;
+  ux &= EXPBITS_SP32;
+  ux >>= EXPSHIFTBITS_SP32;
+  if (ux & 1)
+    {
+      GET_BITS_SP32(y, u);
+      u &= (SIGNBIT_SP32 | MANTBITS_SP32);
+      u |= ONEEXPBITS_SP32;
+      PUT_BITS_SP32(u, y);
+      e = ux - EXPBIAS_SP32;
+    }
+  else
+    {
+      GET_BITS_SP32(y, u);
+      u &= (SIGNBIT_SP32 | MANTBITS_SP32);
+      u |= TWOEXPBITS_SP32;
+      PUT_BITS_SP32(u, y);
+      e = ux - EXPBIAS_SP32 - 1;
+    }
+
+  /* Find the index of the sub-interval of [1,4) in which y lies. */
+
+  index = (int)(32.0F*y+0.5);
+
+  /* Look up the table values and compute c and r = c/t */
+
+  rtc_lead = rt_jby32_lead_table_float[index-32];
+  rtc_trail = rt_jby32_trail_table_float[index-32];
+  c = 0.03125F*index;
+  r = (y - c)/c;
+
+  /*
+  Find q = sqrt(1+r) - 1.
+  From one step of Newton on (q+1)^2 = 1+r
+  */
+
+  p = r*0.5F - r*r*(0.1250079870F - r*(0.6250522999e-01F));
+  twop = p + p;
+  q = p - (p*p + (twop - r))/(twop + 2.0);
+
+  /* Reconstruction */
+
+  rtc = rtc_lead + rtc_trail;
+  e >>= 1; /* e = e/2 */
+  z = rtc_lead + (rtc*q+rtc_trail);
+
+  if (denorm)
+    {
+      /* Scale by 2**(e-13) */
+      PUT_BITS_SP32(((e - 13) + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, r);
+      z *= r;
+    }
+  else
+    {
+      /* Scale by 2**e */
+      PUT_BITS_SP32((e + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, r);
+      z *= r;
+    }
+
+  return z;
+
+}
+#endif /* SQRTF_AMD_INLINE */
+
+#ifdef USE_LOG_KERNEL_AMD
+static inline void log_kernel_amd64(double x, unsigned long long ux, int *xexp, double *r1, double *r2)
+{
+
+  int expadjust;
+  double r, z1, z2, correction, f, f1, f2, q, u, v, poly;
+  int index;
+
+  /*
+    Computes natural log(x). Algorithm based on:
+    Ping-Tak Peter Tang
+    "Table-driven implementation of the logarithm function in IEEE
+    floating-point arithmetic"
+    ACM Transactions on Mathematical Software (TOMS)
+    Volume 16, Issue 4 (December 1990)
+  */
+
+/* Arrays ln_lead_table and ln_tail_table contain
+   leading and trailing parts respectively of precomputed
+   values of natural log(1+i/64), for i = 0, 1, ..., 64.
+   ln_lead_table contains the first 24 bits of precision,
+   and ln_tail_table contains a further 53 bits precision. */
+
+  static const double ln_lead_table[65] = {
+    0.00000000000000000000e+00,   /* 0x0000000000000000 */
+    1.55041813850402832031e-02,   /* 0x3f8fc0a800000000 */
+    3.07716131210327148438e-02,   /* 0x3f9f829800000000 */
+    4.58095073699951171875e-02,   /* 0x3fa7745800000000 */
+    6.06245994567871093750e-02,   /* 0x3faf0a3000000000 */
+    7.52233862876892089844e-02,   /* 0x3fb341d700000000 */
+    8.96121263504028320312e-02,   /* 0x3fb6f0d200000000 */
+    1.03796780109405517578e-01,   /* 0x3fba926d00000000 */
+    1.17783010005950927734e-01,   /* 0x3fbe270700000000 */
+    1.31576299667358398438e-01,   /* 0x3fc0d77e00000000 */
+    1.45181953907012939453e-01,   /* 0x3fc2955280000000 */
+    1.58604979515075683594e-01,   /* 0x3fc44d2b00000000 */
+    1.71850204467773437500e-01,   /* 0x3fc5ff3000000000 */
+    1.84922337532043457031e-01,   /* 0x3fc7ab8900000000 */
+    1.97825729846954345703e-01,   /* 0x3fc9525a80000000 */
+    2.10564732551574707031e-01,   /* 0x3fcaf3c900000000 */
+    2.23143517971038818359e-01,   /* 0x3fcc8ff780000000 */
+    2.35566020011901855469e-01,   /* 0x3fce270700000000 */
+    2.47836112976074218750e-01,   /* 0x3fcfb91800000000 */
+    2.59957492351531982422e-01,   /* 0x3fd0a324c0000000 */
+    2.71933674812316894531e-01,   /* 0x3fd1675c80000000 */
+    2.83768117427825927734e-01,   /* 0x3fd22941c0000000 */
+    2.95464158058166503906e-01,   /* 0x3fd2e8e280000000 */
+    3.07025015354156494141e-01,   /* 0x3fd3a64c40000000 */
+    3.18453729152679443359e-01,   /* 0x3fd4618bc0000000 */
+    3.29753279685974121094e-01,   /* 0x3fd51aad80000000 */
+    3.40926527976989746094e-01,   /* 0x3fd5d1bd80000000 */
+    3.51976394653320312500e-01,   /* 0x3fd686c800000000 */
+    3.62905442714691162109e-01,   /* 0x3fd739d7c0000000 */
+    3.73716354370117187500e-01,   /* 0x3fd7eaf800000000 */
+    3.84411692619323730469e-01,   /* 0x3fd89a3380000000 */
+    3.94993782043457031250e-01,   /* 0x3fd9479400000000 */
+    4.05465066432952880859e-01,   /* 0x3fd9f323c0000000 */
+    4.15827870368957519531e-01,   /* 0x3fda9cec80000000 */
+    4.26084339618682861328e-01,   /* 0x3fdb44f740000000 */
+    4.36236739158630371094e-01,   /* 0x3fdbeb4d80000000 */
+    4.46287095546722412109e-01,   /* 0x3fdc8ff7c0000000 */
+    4.56237375736236572266e-01,   /* 0x3fdd32fe40000000 */
+    4.66089725494384765625e-01,   /* 0x3fddd46a00000000 */
+    4.75845873355865478516e-01,   /* 0x3fde744240000000 */
+    4.85507786273956298828e-01,   /* 0x3fdf128f40000000 */
+    4.95077252388000488281e-01,   /* 0x3fdfaf5880000000 */
+    5.04556000232696533203e-01,   /* 0x3fe02552a0000000 */
+    5.13945698738098144531e-01,   /* 0x3fe0723e40000000 */
+    5.23248136043548583984e-01,   /* 0x3fe0be72e0000000 */
+    5.32464742660522460938e-01,   /* 0x3fe109f380000000 */
+    5.41597247123718261719e-01,   /* 0x3fe154c3c0000000 */
+    5.50647079944610595703e-01,   /* 0x3fe19ee6a0000000 */
+    5.59615731239318847656e-01,   /* 0x3fe1e85f40000000 */
+    5.68504691123962402344e-01,   /* 0x3fe23130c0000000 */
+    5.77315330505371093750e-01,   /* 0x3fe2795e00000000 */
+    5.86049020290374755859e-01,   /* 0x3fe2c0e9e0000000 */
+    5.94707071781158447266e-01,   /* 0x3fe307d720000000 */
+    6.03290796279907226562e-01,   /* 0x3fe34e2880000000 */
+    6.11801505088806152344e-01,   /* 0x3fe393e0c0000000 */
+    6.20240390300750732422e-01,   /* 0x3fe3d90260000000 */
+    6.28608644008636474609e-01,   /* 0x3fe41d8fe0000000 */
+    6.36907458305358886719e-01,   /* 0x3fe4618bc0000000 */
+    6.45137906074523925781e-01,   /* 0x3fe4a4f840000000 */
+    6.53301239013671875000e-01,   /* 0x3fe4e7d800000000 */
+    6.61398470401763916016e-01,   /* 0x3fe52a2d20000000 */
+    6.69430613517761230469e-01,   /* 0x3fe56bf9c0000000 */
+    6.77398800849914550781e-01,   /* 0x3fe5ad4040000000 */
+    6.85303986072540283203e-01,   /* 0x3fe5ee02a0000000 */
+    6.93147122859954833984e-01};  /* 0x3fe62e42e0000000 */
+
+  static const double ln_tail_table[65] = {
+    0.00000000000000000000e+00,   /* 0x0000000000000000 */
+    5.15092497094772879206e-09,   /* 0x3e361f807c79f3db */
+    4.55457209735272790188e-08,   /* 0x3e6873c1980267c8 */
+    2.86612990859791781788e-08,   /* 0x3e5ec65b9f88c69e */
+    2.23596477332056055352e-08,   /* 0x3e58022c54cc2f99 */
+    3.49498983167142274770e-08,   /* 0x3e62c37a3a125330 */
+    3.23392843005887000414e-08,   /* 0x3e615cad69737c93 */
+    1.35722380472479366661e-08,   /* 0x3e4d256ab1b285e9 */
+    2.56504325268044191098e-08,   /* 0x3e5b8abcb97a7aa2 */
+    5.81213608741512136843e-08,   /* 0x3e6f34239659a5dc */
+    5.59374849578288093334e-08,   /* 0x3e6e07fd48d30177 */
+    5.06615629004996189970e-08,   /* 0x3e6b32df4799f4f6 */
+    5.24588857848400955725e-08,   /* 0x3e6c29e4f4f21cf8 */
+    9.61968535632653505972e-10,   /* 0x3e1086c848df1b59 */
+    1.34829655346594463137e-08,   /* 0x3e4cf456b4764130 */
+    3.65557749306383026498e-08,   /* 0x3e63a02ffcb63398 */
+    3.33431709374069198903e-08,   /* 0x3e61e6a6886b0976 */
+    5.13008650536088382197e-08,   /* 0x3e6b8abcb97a7aa2 */
+    5.09285070380306053751e-08,   /* 0x3e6b578f8aa35552 */
+    3.20853940845502057341e-08,   /* 0x3e6139c871afb9fc */
+    4.06713248643004200446e-08,   /* 0x3e65d5d30701ce64 */
+    5.57028186706125221168e-08,   /* 0x3e6de7bcb2d12142 */
+    5.48356693724804282546e-08,   /* 0x3e6d708e984e1664 */
+    1.99407553679345001938e-08,   /* 0x3e556945e9c72f36 */
+    1.96585517245087232086e-09,   /* 0x3e20e2f613e85bda */
+    6.68649386072067321503e-09,   /* 0x3e3cb7e0b42724f6 */
+    5.89936034642113390002e-08,   /* 0x3e6fac04e52846c7 */
+    2.85038578721554472484e-08,   /* 0x3e5e9b14aec442be */
+    5.09746772910284482606e-08,   /* 0x3e6b5de8034e7126 */
+    5.54234668933210171467e-08,   /* 0x3e6dc157e1b259d3 */
+    6.29100830926604004874e-09,   /* 0x3e3b05096ad69c62 */
+    2.61974119468563937716e-08,   /* 0x3e5c2116faba4cdd */
+    4.16752115011186398935e-08,   /* 0x3e665fcc25f95b47 */
+    2.47747534460820790327e-08,   /* 0x3e5a9a08498d4850 */
+    5.56922172017964209793e-08,   /* 0x3e6de647b1465f77 */
+    2.76162876992552906035e-08,   /* 0x3e5da71b7bf7861d */
+    7.08169709942321478061e-09,   /* 0x3e3e6a6886b09760 */
+    5.77453510221151779025e-08,   /* 0x3e6f0075eab0ef64 */
+    4.43021445893361960146e-09,   /* 0x3e33071282fb989b */
+    3.15140984357495864573e-08,   /* 0x3e60eb43c3f1bed2 */
+    2.95077445089736670973e-08,   /* 0x3e5faf06ecb35c84 */
+    1.44098510263167149349e-08,   /* 0x3e4ef1e63db35f68 */
+    1.05196987538551827693e-08,   /* 0x3e469743fb1a71a5 */
+    5.23641361722697546261e-08,   /* 0x3e6c1cdf404e5796 */
+    7.72099925253243069458e-09,   /* 0x3e4094aa0ada625e */
+    5.62089493829364197156e-08,   /* 0x3e6e2d4c96fde3ec */
+    3.53090261098577946927e-08,   /* 0x3e62f4d5e9a98f34 */
+    3.80080516835568242269e-08,   /* 0x3e6467c96ecc5cbe */
+    5.66961038386146408282e-08,   /* 0x3e6e7040d03dec5a */
+    4.42287063097349852717e-08,   /* 0x3e67bebf4282de36 */
+    3.45294525105681104660e-08,   /* 0x3e6289b11aeb783f */
+    2.47132034530447431509e-08,   /* 0x3e5a891d1772f538 */
+    3.59655343422487209774e-08,   /* 0x3e634f10be1fb591 */
+    5.51581770357780862071e-08,   /* 0x3e6d9ce1d316eb93 */
+    3.60171867511861372793e-08,   /* 0x3e63562a19a9c442 */
+    1.94511067964296180547e-08,   /* 0x3e54e2adf548084c */
+    1.54137376631349347838e-08,   /* 0x3e508ce55cc8c97a */
+    3.93171034490174464173e-09,   /* 0x3e30e2f613e85bda */
+    5.52990607758839766440e-08,   /* 0x3e6db03ebb0227bf */
+    3.29990737637586136511e-08,   /* 0x3e61b75bb09cb098 */
+    1.18436010922446096216e-08,   /* 0x3e496f16abb9df22 */
+    4.04248680368301346709e-08,   /* 0x3e65b3f399411c62 */
+    2.27418915900284316293e-08,   /* 0x3e586b3e59f65355 */
+    1.70263791333409206020e-08,   /* 0x3e52482ceae1ac12 */
+    5.76999904754328540596e-08};  /* 0x3e6efa39ef35793c */
+
+  /* Approximating polynomial coefficients for x near 1.0 */
+  static const double
+    ca_1 = 8.33333333333317923934e-02,  /* 0x3fb55555555554e6 */
+    ca_2 = 1.25000000037717509602e-02,  /* 0x3f89999999bac6d4 */
+    ca_3 = 2.23213998791944806202e-03,  /* 0x3f62492307f1519f */
+    ca_4 = 4.34887777707614552256e-04;  /* 0x3f3c8034c85dfff0 */
+
+  /* Approximating polynomial coefficients for other x */
+  static const double
+    cb_1 = 8.33333333333333593622e-02,  /* 0x3fb5555555555557 */
+    cb_2 = 1.24999999978138668903e-02,  /* 0x3f89999999865ede */
+    cb_3 = 2.23219810758559851206e-03;  /* 0x3f6249423bd94741 */
+
+  static const unsigned long long
+    log_thresh1 = 0x3fee0faa00000000,
+    log_thresh2 = 0x3ff1082c00000000;
+
+  /* log_thresh1 = 9.39412117004394531250e-1 = 0x3fee0faa00000000
+     log_thresh2 = 1.06449508666992187500 = 0x3ff1082c00000000 */
+  if (ux >= log_thresh1 && ux <= log_thresh2)
+    {
+      /* Arguments close to 1.0 are handled separately to maintain
+         accuracy.
+
+         The approximation in this region exploits the identity
+             log( 1 + r ) = log( 1 + u/2 )  /  log( 1 - u/2 ), where
+             u  = 2r / (2+r).
+         Note that the right hand side has an odd Taylor series expansion
+         which converges much faster than the Taylor series expansion of
+         log( 1 + r ) in r. Thus, we approximate log( 1 + r ) by
+             u + A1 * u^3 + A2 * u^5 + ... + An * u^(2n+1).
+
+         One subtlety is that since u cannot be calculated from
+         r exactly, the rounding error in the first u should be
+         avoided if possible. To accomplish this, we observe that
+                       u  =  r  -  r*r/(2+r).
+         Since x (=1+r) is the input argument, and thus presumed exact,
+         the formula above approximates u accurately because
+                       u  =  r  -  correction,
+         and the magnitude of "correction" (of the order of r*r)
+         is small.
+         With these observations, we will approximate log( 1 + r ) by
+            r + (  (A1*u^3 + ... + An*u^(2n+1)) - correction ).
+
+         We approximate log(1+r) by an odd polynomial in u, where
+                  u = 2r/(2+r) = r - r*r/(2+r).
+      */
+      r = x - 1.0;
+      u = r / (2.0 + r);
+      correction = r * u;
+      u = u + u;
+      v = u * u;
+      z1 = r;
+      z2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+      *r1 = z1;
+      *r2 = z2;
+      *xexp = 0;
+    }
+  else
+    {
+      /*
+        First, we decompose the argument x to the form
+        x  =  2**M  *  (F1  +  F2),
+        where  1 <= F1+F2 < 2, M has the value of an integer,
+        F1 = 1 + j/64, j ranges from 0 to 64, and |F2| <= 1/128.
+
+        Second, we approximate log( 1 + F2/F1 ) by an odd polynomial
+        in U, where U  =  2 F2 / (2 F2 + F1).
+        Note that log( 1 + F2/F1 ) = log( 1 + U/2 ) - log( 1 - U/2 ).
+        The core approximation calculates
+        Poly = [log( 1 + U/2 ) - log( 1 - U/2 )]/U   -   1.
+        Note that  log(1 + U/2) - log(1 - U/2) = 2 arctanh ( U/2 ),
+        thus, Poly =  2 arctanh( U/2 ) / U  -  1.
+
+        It is not hard to see that
+          log(x) = M*log(2) + log(F1) + log( 1 + F2/F1 ).
+        Hence, we return Z1 = log(F1), and  Z2 = log( 1 + F2/F1).
+        The values of log(F1) are calculated beforehand and stored
+        in the program.
+      */
+
+      f = x;
+      if (ux < IMPBIT_DP64)
+        {
+          /* The input argument x is denormalized */
+          /* Normalize f by increasing the exponent by 60
+             and subtracting a correction to account for the implicit
+             bit. This replaces a slow denormalized
+             multiplication by a fast normal subtraction. */
+          static const double corr = 2.5653355008114851558350183e-290; /* 0x03d0000000000000 */
+          GET_BITS_DP64(f, ux);
+          ux |= 0x03d0000000000000;
+          PUT_BITS_DP64(ux, f);
+          f -= corr;
+          GET_BITS_DP64(f, ux);
+          expadjust = 60;
+        }
+      else
+        expadjust = 0;
+
+      /* Store the exponent of x in xexp and put
+         f into the range [0.5,1) */
+      *xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64 - expadjust;
+      PUT_BITS_DP64((ux & MANTBITS_DP64) | HALFEXPBITS_DP64, f);
+
+      /* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+
+      /* Set index to be the nearest integer to 128*f */
+      r = 128.0 * f;
+      index = (int)(r + 0.5);
+
+      z1 = ln_lead_table[index-64];
+      q = ln_tail_table[index-64];
+      f1 = index * 0.0078125; /* 0.0078125 = 1/128 */
+      f2 = f - f1;
+      /* At this point, x = 2**xexp * ( f1  +  f2 ) where
+         f1 = j/128, j = 64, 65, ..., 128 and |f2| <= 1/256. */
+
+      /* Calculate u = 2 f2 / ( 2 f1 + f2 ) = f2 / ( f1 + 0.5*f2 ) */
+      /* u = f2 / (f1 + 0.5 * f2); */
+      u = f2 / (f1 + 0.5 * f2);
+
+      /* Here, |u| <= 2(exp(1/16)-1) / (exp(1/16)+1).
+         The core approximation calculates
+         poly = [log(1 + u/2) - log(1 - u/2)]/u  -  1  */
+      v = u * u;
+      poly = (v * (cb_1 + v * (cb_2 + v * cb_3)));
+      z2 = q + (u + u * poly);
+      *r1 = z1;
+      *r2 = z2;
+    }
+  return;
+}
+#endif /* USE_LOG_KERNEL_AMD */
+
+#if defined(USE_REMAINDER_PIBY2F_INLINE)
+/* Define this to get debugging print statements activated */
+#define DEBUGGING_PRINT
+#undef DEBUGGING_PRINT
+
+
+#ifdef DEBUGGING_PRINT
+#include <stdio.h>
+char *d2b(long long d, int bitsper, int point)
+{
+  static char buff[200];
+  int i, j;
+  j = bitsper;
+  if (point >= 0 && point <= bitsper)
+    j++;
+  buff[j] = '\0';
+  for (i = bitsper - 1; i >= 0; i--)
+    {
+      j--;
+      if (d % 2 == 1)
+        buff[j] = '1';
+      else
+        buff[j] = '0';
+      if (i == point)
+        {
+          j--;
+          buff[j] = '.';
+        }
+      d /= 2;
+    }
+  return buff;
+}
+#endif
+
+/* Given positive argument x, reduce it to the range [-pi/4,pi/4] using
+   extra precision, and return the result in r.
+   Return value "region" tells how many lots of pi/2 were subtracted
+   from x to put it in the range [-pi/4,pi/4], mod 4. */
+static inline void __remainder_piby2f_inline(unsigned long long ux, double *r, int *region)
+{
+
+      /* This method simulates multi-precision floating-point
+         arithmetic and is accurate for all 1 <= x < infinity */
+#define bitsper 36
+      unsigned long long res[10];
+      unsigned long long u, carry, mask, mant, nextbits;
+      int first, last, i, rexp, xexp, resexp, ltb, determ, bc;
+      double dx;
+      static const double
+        piby2 = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */
+#ifdef WINDOWS
+      static unsigned long long pibits[] =
+      {
+        0LL,
+        5215LL, 13000023176LL, 11362338026LL, 67174558139LL,
+        34819822259LL, 10612056195LL, 67816420731LL, 57840157550LL,
+        19558516809LL, 50025467026LL, 25186875954LL, 18152700886LL
+      };
+#else
+      static unsigned long long pibits[] =
+      {
+        0L,
+        5215L, 13000023176L, 11362338026L, 67174558139L,
+        34819822259L, 10612056195L, 67816420731L, 57840157550L,
+        19558516809L, 50025467026L, 25186875954L, 18152700886L
+      };
+#endif
+
+#ifdef DEBUGGING_PRINT
+      printf("On entry, x = %25.20e = %s\n", x, double2hex(&x));
+#endif
+
+      xexp = (int)(((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64);
+      ux = ((ux & MANTBITS_DP64) | IMPBIT_DP64) >> 29;
+
+#ifdef DEBUGGING_PRINT
+      printf("ux = %s\n", d2b(ux, 64, -1));
+#endif
+
+      /* Now ux is the mantissa bit pattern of x as a long integer */
+      mask = 1;
+      mask = (mask << bitsper) - 1;
+
+      /* Set first and last to the positions of the first
+         and last chunks of 2/pi that we need */
+      first = xexp / bitsper;
+      resexp = xexp - first * bitsper;
+      /* 120 is the theoretical maximum number of bits (actually
+         115 for IEEE single precision) that we need to extract
+         from the middle of 2/pi to compute the reduced argument
+         accurately enough for our purposes */
+      last = first + 120 / bitsper;
+
+#ifdef DEBUGGING_PRINT
+      printf("first = %d, last = %d\n", first, last);
+#endif
+
+      /* Do a long multiplication of the bits of 2/pi by the
+         integer mantissa */
+      /* Unroll the loop. This is only correct because we know
+         that bitsper is fixed as 36. */
+      res[4] = 0;
+      u = pibits[last] * ux;
+      res[3] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last - 1] * ux + carry;
+      res[2] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last - 2] * ux + carry;
+      res[1] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[first] * ux + carry;
+      res[0] = u & mask;
+
+#ifdef DEBUGGING_PRINT
+      printf("resexp = %d\n", resexp);
+      printf("Significant part of x * 2/pi with binary"
+             " point in correct place:\n");
+      for (i = 0; i <= last - first; i++)
+        {
+          if (i > 0 && i % 5 == 0)
+            printf("\n ");
+          if (i == 1)
+            printf("%s ", d2b(res[i], bitsper, resexp));
+          else
+            printf("%s ", d2b(res[i], bitsper, -1));
+        }
+      printf("\n");
+#endif
+
+      /* Reconstruct the result */
+      ltb = (int)((((res[0] << bitsper) | res[1])
+                   >> (bitsper - 1 - resexp)) & 7);
+
+      /* determ says whether the fractional part is >= 0.5 */
+      determ = ltb & 1;
+
+#ifdef DEBUGGING_PRINT
+      printf("ltb = %d (last two bits before binary point"
+             " and first bit after)\n", ltb);
+      printf("determ = %d (1 means need to negate because the fractional\n"
+             "            part of x * 2/pi is greater than 0.5)\n", determ);
+#endif
+
+      i = 1;
+      if (determ)
+        {
+          /* The mantissa is >= 0.5. We want to subtract it
+             from 1.0 by negating all the bits */
+          *region = ((ltb >> 1) + 1) & 3;
+          mant = 1;
+          mant = ~(res[1]) & ((mant << (bitsper - resexp)) - 1);
+          while (mant < 0x0000000000010000)
+            {
+              i++;
+              mant = (mant << bitsper) | (~(res[i]) & mask);
+            }
+          nextbits = (~(res[i+1]) & mask);
+        }
+      else
+        {
+          *region = (ltb >> 1);
+          mant = 1;
+          mant = res[1] & ((mant << (bitsper - resexp)) - 1);
+          while (mant < 0x0000000000010000)
+            {
+              i++;
+              mant = (mant << bitsper) | res[i];
+            }
+          nextbits = res[i+1];
+        }
+
+#ifdef DEBUGGING_PRINT
+      printf("First bits of mant = %s\n", d2b(mant, bitsper, -1));
+#endif
+
+      /* Normalize the mantissa. The shift value 6 here, determined by
+         trial and error, seems to give optimal speed. */
+      bc = 0;
+      while (mant < 0x0000400000000000LL)
+        {
+          bc += 6;
+          mant <<= 6;
+        }
+      while (mant < 0x0010000000000000LL)
+        {
+          bc++;
+          mant <<= 1;
+        }
+      mant |= nextbits >> (bitsper - bc);
+
+      rexp = 52 + resexp - bc - i * bitsper;
+
+#ifdef DEBUGGING_PRINT
+      printf("Normalised mantissa = 0x%016lx\n", mant);
+      printf("Exponent to be inserted on mantissa = rexp = %d\n", rexp);
+#endif
+
+      /* Put the result exponent rexp onto the mantissa pattern */
+      u = ((unsigned long long)rexp + EXPBIAS_DP64) << EXPSHIFTBITS_DP64;
+      ux = (mant & MANTBITS_DP64) | u;
+      if (determ)
+        /* If we negated the mantissa we negate x too */
+        ux |= SIGNBIT_DP64;
+      PUT_BITS_DP64(ux, dx);
+
+#ifdef DEBUGGING_PRINT
+      printf("(x*2/pi) = %25.20e = %s\n", dx, double2hex(&dx));
+#endif
+
+      /* x is a double precision version of the fractional part of
+         x * 2 / pi. Multiply x by pi/2 in double precision
+         to get the reduced argument r. */
+      *r = dx * piby2;
+
+#ifdef DEBUGGING_PRINT
+      printf(" r = frac(x*2/pi) * pi/2:\n");
+      printf(" r = %25.20e = %s\n", *r, double2hex(r));
+      printf("region = (number of pi/2 subtracted from x) mod 4 = %d\n",
+             *region);
+#endif
+}
+#endif /* USE_REMAINDER_PIBY2F_INLINE */
+
+#if defined(WINDOWS)
+#if defined(USE_HANDLE_ERROR) || defined(USE_HANDLE_ERRORF)
+#include <errno.h>
+#endif
+
+#if defined(USE_HANDLE_ERROR)
+/* Define the Microsoft specific error handling routines */
+static __declspec(noinline) double handle_error(const char *name,
+                                                unsigned long long value,
+                                                int type, int flags, int error,
+                                                double arg1, double arg2)
+{
+  double z;
+  struct _exception exception_data;
+  exception_data.type = type;
+  exception_data.name = (char*)name;
+  exception_data.arg1 = arg1;
+  exception_data.arg2 = arg2;
+  PUT_BITS_DP64(value, z);
+  exception_data.retval = z;
+  raise_fpsw_flags(flags);
+  if (!_matherr(&exception_data))
+    {
+      errno = error;
+    }
+  return exception_data.retval;
+}
+#endif /* USE_HANDLE_ERROR */
+
+#if defined(USE_HANDLE_ERRORF)
+static __declspec(noinline) float handle_errorf(const char *name,
+                                                unsigned int value,
+                                                int type, int flags, int error,
+                                                float arg1, float arg2)
+{
+  float z;
+  struct _exception exception_data;
+  exception_data.type = type;
+  exception_data.name = (char*)name;
+  exception_data.arg1 = (double)arg1;
+  exception_data.arg2 = (double)arg2;
+  PUT_BITS_SP32(value, z);
+  exception_data.retval = z;
+  raise_fpsw_flags(flags);
+  if (!_matherr(&exception_data))
+    {
+      errno = error;
+    }
+  return (float)exception_data.retval;
+}
+#endif /* USE_HANDLE_ERRORF */
+#endif /* WINDOWS */
+
+#endif /* LIBM_INLINES_AMD_H_INCLUDED */

diff --git a/inc/libm_special.h b/inc/libm_special.h
new file mode 100644
index 0000000..0833b7b
--- /dev/null
+++ b/inc/libm_special.h

@@ -0,0 +1,84 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef __LIBM_SPECIAL_H__
+#define __LIBM_SPECIAL_H__
+
+// exception status set
+#define MXCSR_ES_INEXACT       0x00000020
+#define MXCSR_ES_UNDERFLOW     0x00000010
+#define MXCSR_ES_OVERFLOW      0x00000008
+#define MXCSR_ES_DIVBYZERO     0x00000004
+#define MXCSR_ES_INVALID       0x00000001
+
+void __amd_handle_errorf(int type, int error, const char *name,
+                    float arg1, unsigned int arg1_is_snan,
+                    float arg2, unsigned int arg2_is_snan,
+                    float retval, unsigned int retval_is_snan);
+
+void __amd_handle_error(int type, int error, const char *name,
+                   double arg1,
+                   double arg2,
+                   double retval);
+
+/* Code from GRTE/v4 math.h */
+/* Types of exceptions in the `type' field.  */
+#ifndef DOMAIN
+struct exception
+  {
+    int type;
+    char *name;
+    double arg1;
+    double arg2;
+    double retval;
+  };
+
+extern int matherr (struct exception *__exc);
+
+# define X_TLOSS        1.41484755040568800000e+16
+
+/* Types of exceptions in the `type' field.  */
+# define DOMAIN         1
+# define SING           2
+# define OVERFLOW       3
+# define UNDERFLOW      4
+# define TLOSS          5
+# define PLOSS          6
+
+/* SVID mode specifies returning this large value instead of infinity.  */
+# define HUGE           3.40282347e+38F
+
+/* Use this define to enable a (dummy) definition of matherr().  */
+#define NEED_FAKE_MATHERR
+
+#else   /* !SVID */
+
+# ifdef __USE_XOPEN
+/* X/Open wants another strange constant.  */
+#  define MAXFLOAT      3.40282347e+38F
+# endif
+
+#endif  /* DOMAIN */
+/* Code from GRTE/v4 math.h */
+
+#endif // __LIBM_SPECIAL_H__

diff --git a/inc/libm_util_amd.h b/inc/libm_util_amd.h
new file mode 100644
index 0000000..f7347d0
--- /dev/null
+++ b/inc/libm_util_amd.h

@@ -0,0 +1,195 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef LIBM_UTIL_AMD_H_INCLUDED
+#define LIBM_UTIL_AMD_H_INCLUDED 1
+
+
+
+
+
+
+typedef float F32;
+typedef unsigned int U32;
+typedef int S32;
+
+typedef double F64;
+typedef unsigned long long  U64;
+typedef long long S64;
+
+union UT32_ 
+{
+    F32 f32;
+    U32 u32;
+};
+
+union UT64_ 
+{
+    F64 f64;
+    U64 u64;
+    
+    F32 f32[2];
+    U32 u32[2];
+};
+
+typedef union UT32_ UT32;
+typedef union UT64_ UT64;
+
+
+
+
+#define QNAN_MASK_32        0x00400000
+#define QNAN_MASK_64        0x0008000000000000
+
+
+#define MULTIPLIER_SP 24
+#define MULTIPLIER_DP 53
+
+static const double VAL_2PMULTIPLIER_DP =  9007199254740992.0;
+static const double VAL_2PMMULTIPLIER_DP = 1.1102230246251565404236316680908e-16;
+static const float VAL_2PMULTIPLIER_SP =  16777216.0F;
+static const float VAL_2PMMULTIPLIER_SP = 5.9604645e-8F;
+
+
+
+
+
+/* Definitions for double functions on 64 bit machines */
+#define SIGNBIT_DP64      0x8000000000000000
+#define EXPBITS_DP64      0x7ff0000000000000
+#define MANTBITS_DP64     0x000fffffffffffff
+#define ONEEXPBITS_DP64   0x3ff0000000000000
+#define TWOEXPBITS_DP64   0x4000000000000000
+#define HALFEXPBITS_DP64  0x3fe0000000000000
+#define IMPBIT_DP64       0x0010000000000000
+#define QNANBITPATT_DP64  0x7ff8000000000000
+#define INDEFBITPATT_DP64 0xfff8000000000000
+#define PINFBITPATT_DP64  0x7ff0000000000000
+#define NINFBITPATT_DP64  0xfff0000000000000
+#define EXPBIAS_DP64      1023
+#define EXPSHIFTBITS_DP64 52
+#define BIASEDEMIN_DP64   1
+#define EMIN_DP64         -1022
+#define BIASEDEMAX_DP64   2046
+#define EMAX_DP64         1023
+#define LAMBDA_DP64       1.0e300
+#define MANTLENGTH_DP64   53
+#define BASEDIGITS_DP64   15
+
+
+/* These definitions, used by float functions,
+   are for both 32 and 64 bit machines */
+#define SIGNBIT_SP32      0x80000000
+#define EXPBITS_SP32      0x7f800000
+#define MANTBITS_SP32     0x007fffff
+#define ONEEXPBITS_SP32   0x3f800000
+#define TWOEXPBITS_SP32   0x40000000
+#define HALFEXPBITS_SP32  0x3f000000
+#define IMPBIT_SP32       0x00800000
+#define QNANBITPATT_SP32  0x7fc00000
+#define INDEFBITPATT_SP32 0xffc00000
+#define PINFBITPATT_SP32  0x7f800000
+#define NINFBITPATT_SP32  0xff800000
+#define EXPBIAS_SP32      127
+#define EXPSHIFTBITS_SP32 23
+#define BIASEDEMIN_SP32   1
+#define EMIN_SP32         -126
+#define BIASEDEMAX_SP32   254
+#define EMAX_SP32         127
+#define LAMBDA_SP32       1.0e30
+#define MANTLENGTH_SP32   24
+#define BASEDIGITS_SP32   7
+
+#define CLASS_SIGNALLING_NAN 1
+#define CLASS_QUIET_NAN 2
+#define CLASS_NEGATIVE_INFINITY 3
+#define CLASS_NEGATIVE_NORMAL_NONZERO 4
+#define CLASS_NEGATIVE_DENORMAL 5
+#define CLASS_NEGATIVE_ZERO 6
+#define CLASS_POSITIVE_ZERO 7
+#define CLASS_POSITIVE_DENORMAL 8
+#define CLASS_POSITIVE_NORMAL_NONZERO 9
+#define CLASS_POSITIVE_INFINITY 10
+
+#define OLD_BITS_SP32(x) (*((unsigned int *)&x))
+#define OLD_BITS_DP64(x) (*((unsigned long long *)&x))
+
+/* Alternatives to the above functions which don't have
+   problems when using high optimization levels on gcc */
+#define GET_BITS_SP32(x, ux) \
+  { \
+    volatile union {float f; unsigned int i;} _bitsy; \
+    _bitsy.f = (x); \
+    ux = _bitsy.i; \
+  }
+#define PUT_BITS_SP32(ux, x) \
+  { \
+    volatile union {float f; unsigned int i;} _bitsy; \
+    _bitsy.i = (ux); \
+     x = _bitsy.f; \
+  }
+
+#define GET_BITS_DP64(x, ux) \
+  { \
+    volatile union {double d; unsigned long long i;} _bitsy; \
+    _bitsy.d = (x); \
+    ux = _bitsy.i; \
+  }
+#define PUT_BITS_DP64(ux, x) \
+  { \
+    volatile union {double d; unsigned long long i;} _bitsy; \
+    _bitsy.i = (ux); \
+    x = _bitsy.d; \
+  }
+
+
+/* Processor-dependent floating-point status flags */
+#define AMD_F_INEXACT 0x00000020
+#define AMD_F_UNDERFLOW 0x00000010
+#define AMD_F_OVERFLOW 0x00000008
+#define AMD_F_DIVBYZERO 0x00000004
+#define AMD_F_INVALID 0x00000001
+
+/* Processor-dependent floating-point precision-control flags */
+#define AMD_F_EXTENDED 0x00000300
+#define AMD_F_DOUBLE   0x00000200
+#define AMD_F_SINGLE   0x00000000
+
+/* Processor-dependent floating-point rounding-control flags */
+#define AMD_F_RC_NEAREST 0x00000000
+#define AMD_F_RC_DOWN    0x00002000
+#define AMD_F_RC_UP      0x00004000
+#define AMD_F_RC_ZERO    0x00006000
+
+/* How to get hold of an assembly square root instruction:
+ *   ASMQRT(x,y) computes y = sqrt(x).
+ */
+#ifdef WINDOWS
+/* VC++ intrinsic call */
+#define ASMSQRT(x,y) _mm_store_sd(&y, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&x)));
+#else
+/* Hammer sqrt instruction */
+#define ASMSQRT(x,y) asm volatile ("sqrtsd %1, %0" : "=x" (y) : "x" (x));
+#endif
+
+#endif /* LIBM_UTIL_AMD_H_INCLUDED */

diff --git a/libacml.h b/libacml.h
new file mode 100644
index 0000000..92c2ccb
--- /dev/null
+++ b/libacml.h

@@ -0,0 +1,76 @@
+// Copyright 2010 and onwards Google Inc.
+// Author: Martin Thuresson
+//
+// Expose fast k8 implementation of math functions with the prefix
+// "acml_".  Currently acml_log(), acml_exp(), and acmp_pow() have
+// shown to have significantly better performance over glibc libm
+// and atleast as good precision.
+// https://wiki.corp.google.com/twiki/bin/view/Main/CompilerMathOptimization
+//
+// When build with --cpu=piii, acml_* will call the pure libm functions,
+// avoiding the need to special case the calls.
+//
+// TODO(martint): Update glibc to match the libacml performance.
+
+#ifndef THIRD_PARTY__OPEN64_LIBACML_MV__LIBACML_H_
+#define THIRD_PARTY__OPEN64_LIBACML_MV__LIBACML_H_
+
+#ifndef USE_LIBACML_IMPLEMENTATION
+#define USE_LIBACML_IMPLEMENTATION defined(__x86_64__)
+#endif
+
+#if USE_LIBACML_IMPLEMENTATION
+#include "third_party/open64_libacml_mv/inc/fn_macros.h"
+#else
+#include <math.h>
+#endif
+
+extern "C" {
+
+#if USE_LIBACML_IMPLEMENTATION
+// The k8 implementation of the math functions.
+#define acml_exp_k8 FN_PROTOTYPE(exp)
+#define acml_expf_k8 FN_PROTOTYPE(expf)
+#define acml_log_k8 FN_PROTOTYPE(log)
+#define acml_pow_k8 FN_PROTOTYPE(pow)
+double acml_exp_k8(double x);
+float acml_expf_k8(float x);
+double acml_log_k8(double x);
+double acml_pow_k8(double x, double y);
+#endif
+
+static inline double acml_exp(double x) {
+#if USE_LIBACML_IMPLEMENTATION
+  return acml_exp_k8(x);
+#else
+  return exp(x);
+#endif
+}
+
+static inline float acml_expf(float x) {
+#if USE_LIBACML_IMPLEMENTATION
+  return acml_expf_k8(x);
+#else
+  return expf(x);
+#endif
+}
+
+static inline double acml_log(double x) {
+#if USE_LIBACML_IMPLEMENTATION
+  return acml_log_k8(x);
+#else
+  return log(x);
+#endif
+}
+
+static inline double acml_pow(double x, double y) {
+#if USE_LIBACML_IMPLEMENTATION
+  return acml_pow_k8(x, y);
+#else
+  return pow(x, y);
+#endif
+}
+
+}
+
+#endif  // THIRD_PARTY__OPEN64_LIBACML_MV__LIBACML_H_

diff --git a/libacml_portability_test.cc b/libacml_portability_test.cc
new file mode 100644
index 0000000..1f62d1a
--- /dev/null
+++ b/libacml_portability_test.cc

@@ -0,0 +1,16 @@
+#include "testing/base/public/gmock.h"
+#include "testing/base/public/gunit.h"
+#include "third_party/open64_libacml_mv/libacml.h"
+
+namespace {
+
+using ::testing::Eq;
+
+TEST(LibacmlPortabilityTest, Trivial) {
+  EXPECT_THAT(acml_exp(0), Eq(1));
+  EXPECT_THAT(acml_expf(0), Eq(1));
+  EXPECT_THAT(acml_pow(2, 2), Eq(4));
+  EXPECT_THAT(acml_log(1), Eq(0));
+}
+
+}  // namespace

diff --git a/src/acos.c b/src/acos.c
new file mode 100644
index 0000000..26bac6c
--- /dev/null
+++ b/src/acos.c

@@ -0,0 +1,183 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VAL_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x)
+{
+  struct exception exc;
+  exc.arg1 = x;
+  exc.arg2 = x;
+  exc.name = (char *)"acos";
+  exc.type = DOMAIN;
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = HUGE;
+  else
+    exc.retval = nan_with_flags(AMD_F_INVALID);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("acos: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(acos)
+#endif
+
+double FN_PROTOTYPE(acos)(double x)
+{
+  /* Computes arccos(x).
+     The argument is first reduced by noting that arccos(x)
+     is invalid for abs(x) > 1. For denormal and small
+     arguments arccos(x) = pi/2 to machine accuracy.
+     Remaining argument ranges are handled as follows.
+     For abs(x) <= 0.5 use
+     arccos(x) = pi/2 - arcsin(x)
+     = pi/2 - (x + x^3*R(x^2))
+     where R(x^2) is a rational minimax approximation to
+     (arcsin(x) - x)/x^3.
+     For abs(x) > 0.5 exploit the identity:
+     arccos(x) = pi - 2*arcsin(sqrt(1-x)/2)
+     together with the above rational approximation, and
+     reconstruct the terms carefully.
+  */
+
+  /* Some constants and split constants. */
+
+  static const double
+    pi         = 3.1415926535897933e+00, /* 0x400921fb54442d18 */
+    piby2      = 1.5707963267948965580e+00, /* 0x3ff921fb54442d18 */
+    piby2_head = 1.5707963267948965580e+00, /* 0x3ff921fb54442d18 */
+    piby2_tail = 6.12323399573676603587e-17; /* 0x3c91a62633145c07 */
+
+  double u, y, s=0.0, r;
+  int xexp, xnan, transform=0;
+
+  unsigned long long ux, aux, xneg;
+  GET_BITS_DP64(x, ux);
+  aux = ux & ~SIGNBIT_DP64;
+  xneg = (ux & SIGNBIT_DP64);
+  xnan = (aux > PINFBITPATT_DP64);
+  xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+
+  /* Special cases */
+
+  if (xnan)
+    {
+#ifdef WINDOWS
+      return handle_error("acos", ux|0x0008000000000000, _DOMAIN,
+                          0, EDOM, x, 0.0);
+#else
+      return x + x; /* With invalid if it's a signalling NaN */
+#endif
+    }
+  else if (xexp < -56)
+    { /* y small enough that arccos(x) = pi/2 */
+      return val_with_flags(piby2, AMD_F_INEXACT);
+    }
+  else if (xexp >= 0)
+    { /* abs(x) >= 1.0 */
+      if (x == 1.0)
+        return 0.0;
+      else if (x == -1.0)
+        return val_with_flags(pi, AMD_F_INEXACT);
+      else
+#ifdef WINDOWS
+        return handle_error("acos", INDEFBITPATT_DP64, _DOMAIN,
+                            AMD_F_INVALID, EDOM, x, 0.0);
+#else
+        return retval_errno_edom(x);
+#endif
+    }
+
+  if (xneg) y = -x;
+  else y = x;
+
+  transform = (xexp >= -1); /* abs(x) >= 0.5 */
+
+  if (transform)
+    { /* Transform y into the range [0,0.5) */
+      r = 0.5*(1.0 - y);
+#ifdef WINDOWS
+      /* VC++ intrinsic call */
+      _mm_store_sd(&s, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&r)));
+#else
+      /* Hammer sqrt instruction */
+      asm volatile ("sqrtsd %1, %0" : "=x" (s) : "x" (r));
+#endif
+      y = s;
+    }
+  else
+    r = y*y;
+
+  /* Use a rational approximation for [0.0, 0.5] */
+
+  u = r*(0.227485835556935010735943483075 +
+         (-0.445017216867635649900123110649 +
+          (0.275558175256937652532686256258 +
+           (-0.0549989809235685841612020091328 +
+            (0.00109242697235074662306043804220 +
+             0.0000482901920344786991880522822991*r)*r)*r)*r)*r)/
+    (1.36491501334161032038194214209 +
+     (-3.28431505720958658909889444194 +
+      (2.76568859157270989520376345954 +
+       (-0.943639137032492685763471240072 +
+	0.105869422087204370341222318533*r)*r)*r)*r);
+
+  if (transform)
+    { /* Reconstruct acos carefully in transformed region */
+      if (xneg) return pi - 2.0*(s+(y*u - piby2_tail));
+      else
+	{
+	  double c, s1;
+	  unsigned long long us;
+	  GET_BITS_DP64(s, us);
+	  PUT_BITS_DP64(0xffffffff00000000 & us, s1);
+	  c = (r-s1*s1)/(s+s1);
+          return 2.0*s1 + (2.0*c+2.0*y*u);
+	}
+    }
+  else
+    return piby2_head - (x - (piby2_tail - x*u));
+}
+
+weak_alias (__acos, acos)

diff --git a/src/acosf.c b/src/acosf.c
new file mode 100644
index 0000000..4464661
--- /dev/null
+++ b/src/acosf.c

@@ -0,0 +1,181 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VALF_WITH_FLAGS
+#define USE_NANF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NANF_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x)
+{
+  struct exception exc;
+  exc.arg1 = (double)x;
+  exc.arg2 = (double)x;
+  exc.name = (char *)"acosf";
+  exc.type = DOMAIN;
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = HUGE;
+  else
+    exc.retval = nanf_with_flags(AMD_F_INVALID);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("acosf: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(acosf)
+#endif
+
+float FN_PROTOTYPE(acosf)(float x)
+{
+  /* Computes arccos(x).
+     The argument is first reduced by noting that arccos(x)
+     is invalid for abs(x) > 1. For denormal and small
+     arguments arccos(x) = pi/2 to machine accuracy.
+     Remaining argument ranges are handled as follows.
+     For abs(x) <= 0.5 use
+     arccos(x) = pi/2 - arcsin(x)
+     = pi/2 - (x + x^3*R(x^2))
+     where R(x^2) is a rational minimax approximation to
+     (arcsin(x) - x)/x^3.
+     For abs(x) > 0.5 exploit the identity:
+     arccos(x) = pi - 2*arcsin(sqrt(1-x)/2)
+     together with the above rational approximation, and
+     reconstruct the terms carefully.
+  */
+
+  /* Some constants and split constants. */
+
+  static const float
+    piby2      = 1.5707963705e+00F; /* 0x3fc90fdb */
+  static const double
+    pi         = 3.1415926535897933e+00, /* 0x400921fb54442d18 */
+    piby2_head = 1.5707963267948965580e+00, /* 0x3ff921fb54442d18 */
+    piby2_tail = 6.12323399573676603587e-17; /* 0x3c91a62633145c07 */
+
+  float u, y, s = 0.0F, r;
+  int xexp, xnan, transform = 0;
+
+  unsigned int ux, aux, xneg;
+
+  GET_BITS_SP32(x, ux);
+  aux = ux & ~SIGNBIT_SP32;
+  xneg = (ux & SIGNBIT_SP32);
+  xnan = (aux > PINFBITPATT_SP32);
+  xexp = (int)((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+
+  /* Special cases */
+
+  if (xnan)
+    {
+#ifdef WINDOWS
+      return handle_errorf("acosf", ux|0x00400000, _DOMAIN, 0,
+                           EDOM, x, 0.0F);
+#else
+      return x + x; /* With invalid if it's a signalling NaN */
+#endif
+    }
+  else if (xexp < -26)
+    /* y small enough that arccos(x) = pi/2 */
+    return valf_with_flags(piby2, AMD_F_INEXACT);
+  else if (xexp >= 0)
+    { /* abs(x) >= 1.0 */
+      if (x == 1.0F)
+        return 0.0F;
+      else if (x == -1.0F)
+        return valf_with_flags((float)pi, AMD_F_INEXACT);
+      else
+#ifdef WINDOWS
+        return handle_errorf("acosf", INDEFBITPATT_SP32, _DOMAIN,
+                             AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+        return retval_errno_edom(x);
+#endif
+    }
+
+  if (xneg) y = -x;
+  else y = x;
+
+  transform = (xexp >= -1); /* abs(x) >= 0.5 */
+
+  if (transform)
+    { /* Transform y into the range [0,0.5) */
+      r = 0.5F*(1.0F - y);
+#ifdef WINDOWS
+      /* VC++ intrinsic call */
+      _mm_store_ss(&s, _mm_sqrt_ss(_mm_load_ss(&r)));
+#else
+      /* Hammer sqrt instruction */
+      asm volatile ("sqrtss %1, %0" : "=x" (s) : "x" (r));
+#endif
+      y = s;
+    }
+  else
+    r = y*y;
+
+  /* Use a rational approximation for [0.0, 0.5] */
+
+  u=r*(0.184161606965100694821398249421F +
+       (-0.0565298683201845211985026327361F +
+	(-0.0133819288943925804214011424456F -
+	 0.00396137437848476485201154797087F*r)*r)*r)/
+    (1.10496961524520294485512696706F -
+     0.836411276854206731913362287293F*r);
+
+  if (transform)
+    {
+      /* Reconstruct acos carefully in transformed region */
+      if (xneg)
+        return (float)(pi - 2.0*(s+(y*u - piby2_tail)));
+      else
+	{
+	  float c, s1;
+	  unsigned int us;
+	  GET_BITS_SP32(s, us);
+	  PUT_BITS_SP32(0xffff0000 & us, s1);
+	  c = (r-s1*s1)/(s+s1);
+          return 2.0F*s1 + (2.0F*c+2.0F*y*u);
+	}
+    }
+  else
+    return (float)(piby2_head - (x - (piby2_tail - x*u)));
+}
+
+weak_alias (__acosf, acosf)

diff --git a/src/acosh.c b/src/acosh.c
new file mode 100644
index 0000000..f1d62c6
--- /dev/null
+++ b/src/acosh.c

@@ -0,0 +1,447 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_NAN_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#define USE_LOG_KERNEL_AMD
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+#undef USE_LOG_KERNEL_AMD
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x)
+{
+  struct exception exc;
+  exc.arg1 = x;
+  exc.arg2 = x;
+  exc.type = DOMAIN;
+  exc.name = (char *)"acosh";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = -HUGE;
+  else
+    exc.retval = nan_with_flags(AMD_F_INVALID);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("acosh: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "acosh"
+double FN_PROTOTYPE(acosh)(double x)
+{
+
+  unsigned long long ux;
+  double r, rarg, r1, r2;
+  int xexp;
+
+  static const unsigned long long
+    recrteps = 0x4196a09e667f3bcd; /* 1/sqrt(eps) = 9.49062656242515593767e+07 */
+  /* log2_lead and log2_tail sum to an extra-precise version
+     of log(2) */
+
+  static const double
+    log2_lead = 6.93147122859954833984e-01,  /* 0x3fe62e42e0000000 */
+    log2_tail = 5.76999904754328540596e-08;  /* 0x3e6efa39ef35793c */
+
+
+  GET_BITS_DP64(x, ux);
+
+  if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_DP64)
+        {
+          /* x is NaN */
+#ifdef WINDOWS
+          return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN,
+                              AMD_F_INVALID, EDOM, x, 0.0);
+#else
+          return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+        }
+      else
+        {
+          /* x is infinity */
+          if (ux & SIGNBIT_DP64)
+            /* x is negative infinity. Return a NaN. */
+#ifdef WINDOWS
+            return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+                                AMD_F_INVALID, EDOM, x, 0.0);
+#else
+            return retval_errno_edom(x);
+#endif
+          else
+            /* Return positive infinity with no signal */
+            return x;
+        }
+    }
+  else if ((ux & SIGNBIT_DP64) || (ux <= 0x3ff0000000000000))
+    {
+      /* x <= 1.0 */
+      if (ux == 0x3ff0000000000000)
+        {
+          /* x = 1.0; return zero. */
+          return 0.0;
+        }
+      else
+        {
+          /* x is less than 1.0. Return a NaN. */
+#ifdef WINDOWS
+          return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+                              AMD_F_INVALID, EDOM, x, 0.0);
+#else
+          return retval_errno_edom(x);
+#endif
+        }
+    }
+
+
+  if (ux > recrteps)
+    {
+      /* Arguments greater than 1/sqrt(epsilon) in magnitude are
+         approximated by acosh(x) = ln(2) + ln(x) */
+      /* log_kernel_amd(x) returns xexp, r1, r2 such that
+         log(x) = xexp*log(2) + r1 + r2 */
+      log_kernel_amd64(x, ux, &xexp, &r1, &r2);
+      /* Add (xexp+1) * log(2) to z1,z2 to get the result acosh(x).
+         The computed r1 is not subject to rounding error because
+         (xexp+1) has at most 10 significant bits, log(2) has 24 significant
+         bits, and r1 has up to 24 bits; and the exponents of r1
+         and r2 differ by at most 6. */
+      r1 = ((xexp+1) * log2_lead + r1);
+      r2 = ((xexp+1) * log2_tail + r2);
+      return r1 + r2;
+    }
+  else if (ux >= 0x4060000000000000)
+    {
+      /* 128.0 <= x <= 1/sqrt(epsilon) */
+      /* acosh for these arguments is approximated by
+         acosh(x) = ln(x + sqrt(x*x-1)) */
+      rarg = x*x-1.0;
+      /* Use assembly instruction to compute r = sqrt(rarg); */
+      ASMSQRT(rarg,r);
+      r += x;
+      GET_BITS_DP64(r, ux);
+      log_kernel_amd64(r, ux, &xexp, &r1, &r2);
+      r1 = (xexp * log2_lead + r1);
+      r2 = (xexp * log2_tail + r2);
+      return r1 + r2;
+    }
+  else
+    {
+      /* 1.0 < x <= 128.0 */
+      double u1, u2, v1, v2, w1, w2, hx, tx, t, r, s, p1, p2, a1, a2, c1, c2,
+        poly;
+      if (ux >= 0x3ff8000000000000)
+        {
+          /* 1.5 <= x <= 128.0 */
+          /* We use minimax polynomials,
+             based on Abramowitz and Stegun 4.6.32 series
+             expansion for acosh(x), with the log(2x) and 1/(2.2.x^2)
+             terms removed. We compensate for these two terms later.
+          */
+          t = x*x;
+          if (ux >= 0x4040000000000000)
+            {
+              /* [3,2] for 32.0 <= x <= 128.0 */
+              poly =
+                (0.45995704464157438175e-9 +
+                 (-0.89080839823528631030e-9 +
+                  (-0.10370522395596168095e-27 +
+                   0.35255386405811106347e-32 * t) * t) * t) /
+                (0.21941191335882074014e-8 +
+                 (-0.10185073058358334569e-7 +
+                  0.95019562478430648685e-8 * t) * t);
+            }
+          else if (ux >= 0x4020000000000000)
+            {
+              /* [3,3] for 8.0 <= x <= 32.0 */
+              poly =
+                (-0.54903656589072526589e-10 +
+                 (0.27646792387218569776e-9 +
+                  (-0.26912957240626571979e-9 -
+                   0.86712268396736384286e-29 * t) * t) * t) /
+                (-0.24327683788655520643e-9 +
+                 (0.20633757212593175571e-8 +
+                  (-0.45438330985257552677e-8 +
+                   0.28707154390001678580e-8 * t) * t) * t);
+            }
+          else if (ux >= 0x4010000000000000)
+            {
+              /* [4,3] for 4.0 <= x <= 8.0 */
+              poly =
+                (-0.20827370596738166108e-6 +
+                 (0.10232136919220422622e-5 +
+                  (-0.98094503424623656701e-6 +
+                   (-0.11615338819596146799e-18 +
+                    0.44511847799282297160e-21 * t) * t) * t) * t) /
+                (-0.92579451630913718588e-6 +
+                 (0.76997374707496606639e-5 +
+                  (-0.16727286999128481170e-4 +
+                   0.10463413698762590251e-4 * t) * t) * t);
+            }
+          else if (ux >= 0x4000000000000000)
+            {
+              /* [5,5] for 2.0 <= x <= 4.0 */
+              poly =
+                (-0.122195030526902362060e-7 +
+                 (0.157894522814328933143e-6 +
+                  (-0.579951798420930466109e-6 +
+                   (0.803568881125803647331e-6 +
+                    (-0.373906657221148667374e-6 -
+                     0.317856399083678204443e-21 * t) * t) * t) * t) * t) /
+                (-0.516260096352477148831e-7 +
+                 (0.894662592315345689981e-6 +
+                  (-0.475662774453078218581e-5 +
+                   (0.107249291567405130310e-4 +
+                    (-0.107871445525891289759e-4 +
+                     0.398833767702587224253e-5 * t) * t) * t) * t) * t);
+            }
+          else if (ux >= 0x3ffc000000000000)
+            {
+              /* [5,4] for 1.75 <= x <= 2.0 */
+              poly =
+                (0.1437926821253825186e-3 +
+                 (-0.1034078230246627213e-2 +
+                  (0.2015310005461823437e-2 +
+                   (-0.1159685218876828075e-2 +
+                    (-0.9267353551307245327e-11 +
+                     0.2880267770324388034e-12 * t) * t) * t) * t) * t) /
+                (0.6305521447028109891e-3 +
+                 (-0.6816525887775002944e-2 +
+                  (0.2228081831550003651e-1 +
+                   (-0.2836886105406603318e-1 +
+                    0.1236997707206036752e-1 * t) * t) * t) * t);
+            }
+          else
+            {
+              /* [5,4] for 1.5 <= x <= 1.75 */
+              poly =
+                ( 0.7471936607751750826e-3 +
+                  (-0.4849405284371905506e-2 +
+                   (0.8823068059778393019e-2 +
+                    (-0.4825395461288629075e-2 +
+                     (-0.1001984320956564344e-8 +
+                      0.4299919281586749374e-10 * t) * t) * t) * t) * t) /
+                (0.3322359141239411478e-2 +
+                 (-0.3293525930397077675e-1 +
+                  (0.1011351440424239210e0 +
+                   (-0.1227083591622587079e0 +
+                    0.5147099404383426080e-1 * t) * t) * t) * t);
+            }
+          GET_BITS_DP64(x, ux);
+          log_kernel_amd64(x, ux, &xexp, &r1, &r2);
+          r1 = ((xexp+1) * log2_lead + r1);
+          r2 = ((xexp+1) * log2_tail + r2);
+          /* Now (r1,r2) sum to log(2x). Subtract the term
+             1/(2.2.x^2) = 0.25/t, and add poly/t, carefully
+             to maintain precision. (Note that we add poly/t
+             rather than poly because of the *x factor used
+             when generating the minimax polynomial) */
+          v2 = (poly-0.25)/t;
+          r = v2 + r1;
+          s = ((r1 - r) + v2) + r2;
+          v1 = r + s;
+          return v1 + ((r - v1) + s);
+        }
+
+      /* Here 1.0 <= x <= 1.5. It is hard to maintain accuracy here so
+         we have to go to great lengths to do so. */
+
+      /* We compute the value
+           t = x - 1.0 + sqrt(2.0*(x - 1.0) + (x - 1.0)*(x - 1.0))
+         using simulated quad precision. */
+      t = x - 1.0;
+      u1 = t * 2.0;
+
+      /* dekker_mul12(t,t,&v1,&v2); */
+      GET_BITS_DP64(t, ux);
+      ux &= 0xfffffffff8000000;
+      PUT_BITS_DP64(ux, hx);
+      tx = t - hx;
+      v1 = t * t;
+      v2 = (((hx * hx - v1) + hx * tx) + tx * hx) + tx * tx;
+
+      /* dekker_add2(u1,0.0,v1,v2,&w1,&w2); */
+      r = u1 + v1;
+      s = (((u1 - r) + v1) + v2);
+      w1 = r + s;
+      w2 = (r - w1) + s;
+
+      /* dekker_sqrt2(w1,w2,&u1,&u2); */
+      ASMSQRT(w1,p1);
+      GET_BITS_DP64(p1, ux);
+      ux &= 0xfffffffff8000000;
+      PUT_BITS_DP64(ux, c1);
+      c2 = p1 - c1;
+      a1 = p1 * p1;
+      a2 = (((c1 * c1 - a1) + c1 * c2) + c2 * c1) + c2 * c2;
+      p2 = (((w1 - a1) - a2) + w2) * 0.5 / p1;
+      u1 = p1 + p2;
+      u2 = (p1 - u1) + p2;
+
+      /* dekker_add2(u1,u2,t,0.0,&v1,&v2); */
+      r = u1 + t;
+      s = (((u1 - r) + t)) + u2;
+      r1 = r + s;
+      r2 = (r - r1) + s;
+      t = r1 + r2;
+
+      /* Check for x close to 1.0. */
+      if (x < 1.13)
+        {
+          /* Here 1.0 <= x < 1.13 implies r <= 0.656. In this region
+             we need to take extra care to maintain precision.
+             We have t = r1 + r2 = (x - 1.0 + sqrt(x*x-1.0))
+             to more than basic precision. We use the Taylor series
+             for log(1+x), with terms after the O(x*x) term
+             approximated by a [6,6] minimax polynomial. */
+          double b1, b2, c1, c2, e1, e2, q1, q2, c, cc, hr1, tr1, hpoly, tpoly, hq1, tq1, hr2, tr2;
+          poly =
+            (0.30893760556597282162e-21 +
+             (0.10513858797132174471e0 +
+              (0.27834538302122012381e0 +
+               (0.27223638654807468186e0 +
+                (0.12038958198848174570e0 +
+                 (0.23357202004546870613e-1 +
+                  (0.15208417992520237648e-2 +
+                   0.72741030690878441996e-7 * t) * t) * t) * t) * t) * t) * t) /
+            (0.31541576391396523486e0 +
+             (0.10715979719991342022e1 +
+              (0.14311581802952004012e1 +
+               (0.94928647994421895988e0 +
+                (0.32396235926176348977e0 +
+                 (0.52566134756985833588e-1 +
+                  0.30477895574211444963e-2 * t) * t) * t) * t) * t) * t);
+
+          /* Now we can compute the result r = acosh(x) = log1p(t)
+             using the formula t - 0.5*t*t + poly*t*t. Since t is
+             represented as r1+r2, the formula becomes
+             r = r1+r2 - 0.5*(r1+r2)*(r1+r2) + poly*(r1+r2)*(r1+r2).
+             Expanding out, we get
+               r = r1 + r2 - (0.5 + poly)*(r1*r1 + 2*r1*r2 + r2*r2)
+             and ignoring negligible quantities we get
+               r = r1 + r2 - 0.5*r1*r1 + r1*r2 + poly*t*t
+          */
+          if (x < 1.06)
+            {
+              double b, c, e;
+              b = r1*r2;
+              c = 0.5*r1*r1;
+              e = poly*t*t;
+              /* N.B. the order of additions and subtractions is important */
+              r = (((r2 - b) + e) - c) + r1;
+              return r;
+            }
+          else
+            {
+              /* For 1.06 <= x <= 1.13 we must evaluate in extended precision
+                 to reach about 1 ulp accuracy (in this range the simple code
+                 above only manages about 1.5 ulp accuracy) */
+
+              /* Split poly, r1 and r2 into head and tail sections */
+              GET_BITS_DP64(poly, ux);
+              ux &= 0xfffffffff8000000;
+              PUT_BITS_DP64(ux,hpoly);
+              tpoly = poly - hpoly;
+              GET_BITS_DP64(r1,ux);
+              ux &= 0xfffffffff8000000;
+              PUT_BITS_DP64(ux,hr1);
+              tr1 = r1 - hr1;
+              GET_BITS_DP64(r2, ux);
+              ux &= 0xfffffffff8000000;
+              PUT_BITS_DP64(ux,hr2);
+              tr2 = r2 - hr2;
+
+              /* e = poly*t*t */
+              c = poly * r1;
+              cc = (((hpoly * hr1 - c) + hpoly * tr1) + tpoly * hr1) + tpoly * tr1;
+              cc = poly * r2 + cc;
+              q1 = c + cc;
+              q2 = (c - q1) + cc;
+              GET_BITS_DP64(q1, ux);
+              ux &= 0xfffffffff8000000;
+              PUT_BITS_DP64(ux,hq1);
+              tq1 = q1 - hq1;
+              c = q1 * r1;
+              cc = (((hq1 * hr1 - c) + hq1 * tr1) + tq1 * hr1) + tq1 * tr1;
+              cc = q1 * r2 + q2 * r1 + cc;
+              e1 = c + cc;
+              e2 = (c - e1) + cc;
+
+              /* b = r1*r2 */
+              b1 = r1 * r2;
+              b2 = (((hr1 * hr2 - b1) + hr1 * tr2) + tr1 * hr2) + tr1 * tr2;
+
+              /* c = 0.5*r1*r1 */
+              c1 = (0.5*r1) * r1;
+              c2 = (((0.5*hr1 * hr1 - c1) + 0.5*hr1 * tr1) + 0.5*tr1 * hr1) + 0.5*tr1 * tr1;
+
+              /* v = a + d - b */
+              r = r1 - b1;
+              s = (((r1 - r) - b1) - b2) + r2;
+              v1 = r + s;
+              v2 = (r - v1) + s;
+
+              /* w = (a + d - b) - c */
+              r = v1 - c1;
+              s = (((v1 - r) - c1) - c2) + v2;
+              w1 = r + s;
+              w2 = (r - w1) + s;
+
+              /* u = ((a + d - b) - c) + e */
+              r = w1 + e1;
+              s = (((w1 - r) + e1) + e2) + w2;
+              u1 = r + s;
+              u2 = (r - u1) + s;
+
+              /* The result r = acosh(x) */
+              r = u1 + u2;
+
+              return r;
+            }
+        }
+      else
+        {
+          /* For arguments 1.13 <= x <= 1.5 the log1p function
+             is good enough */
+            return FN_PROTOTYPE(log1p)(t);
+        }
+    }
+}
+
+weak_alias (__acosh, acosh)

diff --git a/src/acoshf.c b/src/acoshf.c
new file mode 100644
index 0000000..c96fdb0
--- /dev/null
+++ b/src/acoshf.c

@@ -0,0 +1,149 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include <stdio.h>
+
+#define USE_NANF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NANF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x)
+{
+  struct exception exc;
+  exc.arg1 = (double)x;
+  exc.arg2 = (double)x;
+  exc.type = DOMAIN;
+  exc.name = (char *)"acoshf";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = -HUGE;
+  else
+    exc.retval = nanf_with_flags(AMD_F_INVALID);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("acoshf: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "acoshf"
+float FN_PROTOTYPE(acoshf)(float x)
+{
+
+  unsigned int ux;
+  double dx, r, rarg, t;
+
+  static const unsigned int
+    recrteps = 0x46000000; /* 1/sqrt(eps) = 4.09600000000000000000e+03 */
+
+  static const double
+    log2 = 6.93147180559945286227e-01;  /* 0x3fe62e42fefa39ef */
+
+  GET_BITS_SP32(x, ux);
+
+  if ((ux & EXPBITS_SP32) == EXPBITS_SP32)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_SP32)
+        {
+          /* x is NaN */
+#ifdef WINDOWS
+          return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN,
+                               0, EDOM, x, 0.0F);
+#else
+          return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+        }
+      else
+        {
+          /* x is infinity */
+          if (ux & SIGNBIT_SP32)
+            /* x is negative infinity. Return a NaN. */
+#ifdef WINDOWS
+            return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+                                 AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+            return retval_errno_edom(x);
+#endif
+          else
+            /* Return positive infinity with no signal */
+            return x;
+        }
+    }
+  else if ((ux & SIGNBIT_SP32) || (ux < 0x3f800000))
+    {
+      /* x is less than 1.0. Return a NaN. */
+#ifdef WINDOWS
+      return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+                           AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+      return retval_errno_edom(x);
+#endif
+    }
+
+  dx = x;
+
+  if (ux > recrteps)
+    {
+      /* Arguments greater than 1/sqrt(epsilon) in magnitude are
+         approximated by acoshf(x) = ln(2) + ln(x) */
+      r = FN_PROTOTYPE(log)(dx) + log2;
+    }
+  else if (ux > 0x40000000)
+    {
+      /* 2.0 <= x <= 1/sqrt(epsilon) */
+      /* acoshf for these arguments is approximated by
+         acoshf(x) = ln(x + sqrt(x*x-1)) */
+      rarg = dx*dx-1.0;
+      /* Use assembly instruction to compute r = sqrt(rarg); */
+      ASMSQRT(rarg,r);
+      rarg = r + dx;
+      r = FN_PROTOTYPE(log)(rarg);
+    }
+  else
+    {
+      /* sqrt(epsilon) <= x <= 2.0 */
+      t = dx - 1.0;
+      rarg = 2.0*t + t*t;
+      ASMSQRT(rarg,r);  /* r = sqrt(rarg) */
+      rarg = t + r;
+      r = FN_PROTOTYPE(log1p)(rarg);
+    }
+  return (float)(r);
+}
+
+weak_alias (__acoshf, acoshf)

diff --git a/src/asin.c b/src/asin.c
new file mode 100644
index 0000000..0314dd8
--- /dev/null
+++ b/src/asin.c

@@ -0,0 +1,196 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VAL_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x)
+{
+  struct exception exc;
+  exc.arg1 = x;
+  exc.arg2 = x;
+  exc.type = DOMAIN;
+  exc.name = (char *)"asin";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = HUGE;
+  else
+    exc.retval = nan_with_flags(AMD_F_INVALID);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("asin: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(asin)
+#endif
+
+double FN_PROTOTYPE(asin)(double x)
+{
+  /* Computes arcsin(x).
+     The argument is first reduced by noting that arcsin(x)
+     is invalid for abs(x) > 1 and arcsin(-x) = -arcsin(x).
+     For denormal and small arguments arcsin(x) = x to machine
+     accuracy. Remaining argument ranges are handled as follows.
+     For abs(x) <= 0.5 use
+     arcsin(x) = x + x^3*R(x^2)
+     where R(x^2) is a rational minimax approximation to
+     (arcsin(x) - x)/x^3.
+     For abs(x) > 0.5 exploit the identity:
+      arcsin(x) = pi/2 - 2*arcsin(sqrt(1-x)/2)
+     together with the above rational approximation, and
+     reconstruct the terms carefully.
+    */
+
+  /* Some constants and split constants. */
+
+  static const double
+    piby2_tail  = 6.1232339957367660e-17, /* 0x3c91a62633145c07 */
+    hpiby2_head = 7.8539816339744831e-01, /* 0x3fe921fb54442d18 */
+    piby2       = 1.5707963267948965e+00; /* 0x3ff921fb54442d18 */
+  double u, v, y, s=0.0, r;
+  int xexp, xnan, transform=0;
+
+  unsigned long long ux, aux, xneg;
+  GET_BITS_DP64(x, ux);
+  aux = ux & ~SIGNBIT_DP64;
+  xneg = (ux & SIGNBIT_DP64);
+  xnan = (aux > PINFBITPATT_DP64);
+  xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+
+  /* Special cases */
+
+  if (xnan)
+    {
+#ifdef WINDOWS
+      return handle_error("asin", ux|0x0008000000000000, _DOMAIN,
+                          0, EDOM, x, 0.0);
+#else
+      return x + x; /* With invalid if it's a signalling NaN */
+#endif
+    }
+  else if (xexp < -28)
+    { /* y small enough that arcsin(x) = x */
+      return val_with_flags(x, AMD_F_INEXACT);
+    }
+  else if (xexp >= 0)
+    { /* abs(x) >= 1.0 */
+      if (x == 1.0)
+        return val_with_flags(piby2, AMD_F_INEXACT);
+      else if (x == -1.0)
+        return val_with_flags(-piby2, AMD_F_INEXACT);
+      else
+#ifdef WINDOWS
+        return handle_error("asin", INDEFBITPATT_DP64, _DOMAIN,
+                            AMD_F_INVALID, EDOM, x, 0.0);
+#else
+        return retval_errno_edom(x);
+#endif
+    }
+
+  if (xneg) y = -x;
+  else y = x;
+
+  transform = (xexp >= -1); /* abs(x) >= 0.5 */
+
+  if (transform)
+    { /* Transform y into the range [0,0.5) */
+      r = 0.5*(1.0 - y);
+#ifdef WINDOWS
+      /* VC++ intrinsic call */
+      _mm_store_sd(&s, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&r)));
+#else
+      /* Hammer sqrt instruction */
+      asm volatile ("sqrtsd %1, %0" : "=x" (s) : "x" (r));
+#endif
+      y = s;
+    }
+  else
+    r = y*y;
+
+  /* Use a rational approximation for [0.0, 0.5] */
+
+  u = r*(0.227485835556935010735943483075 +
+         (-0.445017216867635649900123110649 +
+          (0.275558175256937652532686256258 +
+           (-0.0549989809235685841612020091328 +
+            (0.00109242697235074662306043804220 +
+             0.0000482901920344786991880522822991*r)*r)*r)*r)*r)/
+    (1.36491501334161032038194214209 +
+     (-3.28431505720958658909889444194 +
+      (2.76568859157270989520376345954 +
+       (-0.943639137032492685763471240072 +
+        0.105869422087204370341222318533*r)*r)*r)*r);
+
+  if (transform)
+    { /* Reconstruct asin carefully in transformed region */
+        {
+          double c, s1, p, q;
+          unsigned long long us;
+          GET_BITS_DP64(s, us);
+          PUT_BITS_DP64(0xffffffff00000000 & us, s1);
+          c = (r-s1*s1)/(s+s1);
+          p = 2.0*s*u - (piby2_tail-2.0*c);
+          q = hpiby2_head - 2.0*s1;
+          v = hpiby2_head - (p-q);
+        }
+    }
+  else
+    {
+#ifdef WINDOWS
+      /* Use a temporary variable to prevent VC++ rearranging
+            y + y*u
+         into
+            y * (1 + u)
+         and getting an incorrectly rounded result */
+      double tmp;
+      tmp = y * u;
+      v = y + tmp;
+#else
+      v = y + y*u;
+#endif
+    }
+
+  if (xneg) return -v;
+  else return v;
+}
+
+weak_alias (__asin, asin)

diff --git a/src/asinf.c b/src/asinf.c
new file mode 100644
index 0000000..4b42b01
--- /dev/null
+++ b/src/asinf.c

@@ -0,0 +1,190 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VALF_WITH_FLAGS
+#define USE_NANF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NANF_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x)
+{
+  struct exception exc;
+  exc.arg1 = (double)x;
+  exc.arg2 = (double)x;
+  exc.type = DOMAIN;
+  exc.name = (char *)"asinf";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = HUGE;
+  else
+    exc.retval = nanf_with_flags(AMD_F_INVALID);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("asinf: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(asinf)
+#endif
+
+float FN_PROTOTYPE(asinf)(float x)
+{
+  /* Computes arcsin(x).
+     The argument is first reduced by noting that arcsin(x)
+     is invalid for abs(x) > 1 and arcsin(-x) = -arcsin(x).
+     For denormal and small arguments arcsin(x) = x to machine
+     accuracy. Remaining argument ranges are handled as follows.
+     For abs(x) <= 0.5 use
+     arcsin(x) = x + x^3*R(x^2)
+     where R(x^2) is a rational minimax approximation to
+     (arcsin(x) - x)/x^3.
+     For abs(x) > 0.5 exploit the identity:
+      arcsin(x) = pi/2 - 2*arcsin(sqrt(1-x)/2)
+     together with the above rational approximation, and
+     reconstruct the terms carefully.
+    */
+
+  /* Some constants and split constants. */
+
+  static const float
+    piby2_tail  = 7.5497894159e-08F, /* 0x33a22168 */
+    hpiby2_head = 7.8539812565e-01F, /* 0x3f490fda */
+    piby2       = 1.5707963705e+00F; /* 0x3fc90fdb */
+  float u, v, y, s = 0.0F, r;
+  int xexp, xnan, transform = 0;
+
+  unsigned int ux, aux, xneg;
+  GET_BITS_SP32(x, ux);
+  aux = ux & ~SIGNBIT_SP32;
+  xneg = (ux & SIGNBIT_SP32);
+  xnan = (aux > PINFBITPATT_SP32);
+  xexp = (int)((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+
+  /* Special cases */
+
+  if (xnan)
+    {
+#ifdef WINDOWS
+      return handle_errorf("asinf", ux|0x00400000, _DOMAIN, 0,
+                           EDOM, x, 0.0F);
+#else
+      return x + x; /* With invalid if it's a signalling NaN */
+#endif
+    }
+  else if (xexp < -14)
+    /* y small enough that arcsin(x) = x */
+    return valf_with_flags(x, AMD_F_INEXACT);
+  else if (xexp >= 0)
+    {
+      /* abs(x) >= 1.0 */
+      if (x == 1.0F)
+        return valf_with_flags(piby2, AMD_F_INEXACT);
+      else if (x == -1.0F)
+        return valf_with_flags(-piby2, AMD_F_INEXACT);
+      else
+#ifdef WINDOWS
+        return handle_errorf("asinf", INDEFBITPATT_SP32, _DOMAIN,
+                             AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+        return retval_errno_edom(x);
+#endif
+    }
+
+  if (xneg) y = -x;
+  else y = x;
+
+  transform = (xexp >= -1); /* abs(x) >= 0.5 */
+
+  if (transform)
+    { /* Transform y into the range [0,0.5) */
+      r = 0.5F*(1.0F - y);
+#ifdef WINDOWS
+      /* VC++ intrinsic call */
+      _mm_store_ss(&s, _mm_sqrt_ss(_mm_load_ss(&r)));
+#else
+      /* Hammer sqrt instruction */
+      asm volatile ("sqrtss %1, %0" : "=x" (s) : "x" (r));
+#endif
+      y = s;
+    }
+  else
+    r = y*y;
+
+  /* Use a rational approximation for [0.0, 0.5] */
+
+  u=r*(0.184161606965100694821398249421F +
+       (-0.0565298683201845211985026327361F +
+	(-0.0133819288943925804214011424456F -
+	 0.00396137437848476485201154797087F*r)*r)*r)/
+    (1.10496961524520294485512696706F -
+     0.836411276854206731913362287293F*r);
+
+  if (transform)
+    {
+      /* Reconstruct asin carefully in transformed region */
+      float c, s1, p, q;
+      unsigned int us;
+      GET_BITS_SP32(s, us);
+      PUT_BITS_SP32(0xffff0000 & us, s1);
+      c = (r-s1*s1)/(s+s1);
+      p = 2.0F*s*u - (piby2_tail-2.0F*c);
+      q = hpiby2_head - 2.0F*s1;
+      v = hpiby2_head - (p-q);
+    }
+  else
+    {
+#ifdef WINDOWS
+      /* Use a temporary variable to prevent VC++ rearranging
+            y + y*u
+         into
+            y * (1 + u)
+         and getting an incorrectly rounded result */
+      float tmp;
+      tmp = y * u;
+      v = y + tmp;
+#else
+      v = y + y*u;
+#endif
+    }
+
+  if (xneg) return -v;
+  else return v;
+}
+
+weak_alias (__asinf, asinf)

diff --git a/src/asinh.c b/src/asinh.c
new file mode 100644
index 0000000..7ecde9c
--- /dev/null
+++ b/src/asinh.c

@@ -0,0 +1,322 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_HANDLE_ERROR
+#define USE_LOG_KERNEL_AMD
+#define USE_VAL_WITH_FLAGS
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERROR
+#undef USE_LOG_KERNEL_AMD
+#undef VAL_WITH_FLAGS
+
+#undef _FUNCNAME
+#define _FUNCNAME "asinh"
+double FN_PROTOTYPE(asinh)(double x)
+{
+
+  unsigned long long ux, ax, xneg;
+  double absx, r, rarg, t, r1, r2, poly, s, v1, v2;
+  int xexp;
+
+  static const unsigned long long
+    rteps = 0x3e46a09e667f3bcd,    /* sqrt(eps) = 1.05367121277235086670e-08 */
+    recrteps = 0x4196a09e667f3bcd; /* 1/rteps = 9.49062656242515593767e+07 */
+
+  /* log2_lead and log2_tail sum to an extra-precise version
+     of log(2) */
+  static const double
+    log2_lead = 6.93147122859954833984e-01,  /* 0x3fe62e42e0000000 */
+    log2_tail = 5.76999904754328540596e-08;  /* 0x3e6efa39ef35793c */
+
+
+  GET_BITS_DP64(x, ux);
+  ax = ux & ~SIGNBIT_DP64;
+  xneg = ux & SIGNBIT_DP64;
+  PUT_BITS_DP64(ax, absx);
+
+  if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_DP64)
+        {
+          /* x is NaN */
+#ifdef WINDOWS
+          return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN,
+                              AMD_F_INVALID, EDOM, x, 0.0);
+#else
+          return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+        }
+      else
+        {
+          /* x is infinity. Return the same infinity. */
+#ifdef WINDOWS
+          if (ux & SIGNBIT_DP64)
+            return handle_error(_FUNCNAME, NINFBITPATT_DP64, _DOMAIN,
+                                AMD_F_INVALID, EDOM, x, 0.0);
+          else
+            return handle_error(_FUNCNAME, PINFBITPATT_DP64, _DOMAIN,
+                                AMD_F_INVALID, EDOM, x, 0.0);
+#else
+          return x;
+#endif
+        }
+    }
+  else if (ax < rteps) /* abs(x) < sqrt(epsilon) */
+    {
+      if (ax == 0x0000000000000000)
+        {
+          /* x is +/-zero. Return the same zero. */
+          return x;
+        }
+      else
+        {
+          /* Tiny arguments approximated by asinh(x) = x
+             - avoid slow operations on denormalized numbers */
+          return val_with_flags(x,AMD_F_INEXACT);
+        }
+    }
+
+
+  if (ax <= 0x3ff0000000000000) /* abs(x) <= 1.0 */
+    {
+      /* Arguments less than 1.0 in magnitude are
+         approximated by [4,4] or [5,4] minimax polynomials
+         fitted to asinh series 4.6.31 (x < 1) from Abramowitz and Stegun
+      */
+      t = x*x;
+      if (ax < 0x3fd0000000000000)
+        {
+          /* [4,4] for 0 < abs(x) < 0.25 */
+          poly =
+            (-0.12845379283524906084997e0 +
+             (-0.21060688498409799700819e0 +
+              (-0.10188951822578188309186e0 +
+               (-0.13891765817243625541799e-1 -
+                0.10324604871728082428024e-3 * t) * t) * t) * t) /
+            (0.77072275701149440164511e0 +
+             (0.16104665505597338100747e1 +
+              (0.11296034614816689554875e1 +
+               (0.30079351943799465092429e0 +
+                0.235224464765951442265117e-1 * t) * t) * t) * t);
+        }
+      else if (ax < 0x3fe0000000000000)
+        {
+          /* [4,4] for 0.25 <= abs(x) < 0.5 */
+          poly =
+            (-0.12186605129448852495563e0 +
+             (-0.19777978436593069928318e0 +
+              (-0.94379072395062374824320e-1 +
+               (-0.12620141363821680162036e-1 -
+                0.903396794842691998748349e-4 * t) * t) * t) * t) /
+            (0.73119630776696495279434e0 +
+             (0.15157170446881616648338e1 +
+              (0.10524909506981282725413e1 +
+               (0.27663713103600182193817e0 +
+                0.21263492900663656707646e-1 * t) * t) * t) * t);
+        }
+      else if (ax < 0x3fe8000000000000)
+        {
+          /* [4,4] for 0.5 <= abs(x) < 0.75 */
+          poly =
+            (-0.81210026327726247622500e-1 +
+             (-0.12327355080668808750232e0 +
+              (-0.53704925162784720405664e-1 +
+               (-0.63106739048128554465450e-2 -
+                0.35326896180771371053534e-4 * t) * t) * t) * t) /
+            (0.48726015805581794231182e0 +
+             (0.95890837357081041150936e0 +
+              (0.62322223426940387752480e0 +
+               (0.15028684818508081155141e0 +
+                0.10302171620320141529445e-1 * t) * t) * t) * t);
+        }
+      else
+        {
+          /* [5,4] for 0.75 <= abs(x) <= 1.0 */
+          poly =
+            (-0.4638179204422665073e-1 +
+             (-0.7162729496035415183e-1 +
+              (-0.3247795155696775148e-1 +
+               (-0.4225785421291932164e-2 +
+                (-0.3808984717603160127e-4 +
+                 0.8023464184964125826e-6 * t) * t) * t) * t) * t) /
+            (0.2782907534642231184e0 +
+             (0.5549945896829343308e0 +
+              (0.3700732511330698879e0 +
+               (0.9395783438240780722e-1 +
+                0.7200057974217143034e-2 * t) * t) * t) * t);
+        }
+      return x + x*t*poly;
+    }
+  else if (ax < 0x4040000000000000)
+    {
+      /* 1.0 <= abs(x) <= 32.0 */
+      /* Arguments in this region are approximated by various
+         minimax polynomials fitted to asinh series 4.6.31
+         in Abramowitz and Stegun.
+      */
+      t = x*x;
+      if (ax >= 0x4020000000000000)
+        {
+          /* [3,3] for 8.0 <= abs(x) <= 32.0 */
+          poly =
+            (-0.538003743384069117e-10 +
+             (-0.273698654196756169e-9 +
+              (-0.268129826956403568e-9 -
+               0.804163374628432850e-29 * t) * t) * t) /
+            (0.238083376363471960e-9 +
+             (0.203579344621125934e-8 +
+              (0.450836980450693209e-8 +
+               0.286005148753497156e-8 * t) * t) * t);
+        }
+      else if (ax >= 0x4010000000000000)
+        {
+          /* [4,3] for 4.0 <= abs(x) <= 8.0 */
+          poly =
+            (-0.178284193496441400e-6 +
+             (-0.928734186616614974e-6 +
+              (-0.923318925566302615e-6 +
+               (-0.776417026702577552e-19 +
+                0.290845644810826014e-21 * t) * t) * t) * t) /
+            (0.786694697277890964e-6 +
+             (0.685435665630965488e-5 +
+              (0.153780175436788329e-4 +
+               0.984873520613417917e-5 * t) * t) * t);
+
+        }
+      else if (ax >= 0x4000000000000000)
+        {
+          /* [5,4] for 2.0 <= abs(x) <= 4.0 */
+          poly =
+            (-0.209689451648100728e-6 +
+             (-0.219252358028695992e-5 +
+              (-0.551641756327550939e-5 +
+               (-0.382300259826830258e-5 +
+                (-0.421182121910667329e-17 +
+                 0.492236019998237684e-19 * t) * t) * t) * t) * t) /
+            (0.889178444424237735e-6 +
+             (0.131152171690011152e-4 +
+              (0.537955850185616847e-4 +
+               (0.814966175170941864e-4 +
+                0.407786943832260752e-4 * t) * t) * t) * t);
+        }
+      else if (ax >= 0x3ff8000000000000)
+        {
+          /* [5,4] for 1.5 <= abs(x) <= 2.0 */
+          poly =
+            (-0.195436610112717345e-4 +
+             (-0.233315515113382977e-3 +
+              (-0.645380957611087587e-3 +
+               (-0.478948863920281252e-3 +
+                (-0.805234112224091742e-12 +
+                 0.246428598194879283e-13 * t) * t) * t) * t) * t) /
+            (0.822166621698664729e-4 +
+             (0.135346265620413852e-2 +
+              (0.602739242861830658e-2 +
+               (0.972227795510722956e-2 +
+                0.510878800983771167e-2 * t) * t) * t) * t);
+        }
+      else
+        {
+          /* [5,5] for 1.0 <= abs(x) <= 1.5 */
+          poly =
+            (-0.121224194072430701e-4 +
+             (-0.273145455834305218e-3 +
+              (-0.152866982560895737e-2 +
+               (-0.292231744584913045e-2 +
+                (-0.174670900236060220e-2 -
+                 0.891754209521081538e-12 * t) * t) * t) * t) * t) /
+            (0.499426632161317606e-4 +
+             (0.139591210395547054e-2 +
+              (0.107665231109108629e-1 +
+               (0.325809818749873406e-1 +
+                (0.415222526655158363e-1 +
+                 0.186315628774716763e-1 * t) * t) * t) * t) * t);
+        }
+      log_kernel_amd64(absx, ax, &xexp, &r1, &r2);
+      r1 = ((xexp+1) * log2_lead + r1);
+      r2 = ((xexp+1) * log2_tail + r2);
+      /* Now (r1,r2) sum to log(2x). Add the term
+         1/(2.2.x^2) = 0.25/t, and add poly/t, carefully
+         to maintain precision. (Note that we add poly/t
+         rather than poly because of the *x factor used
+         when generating the minimax polynomial) */
+      v2 = (poly+0.25)/t;
+      r = v2 + r1;
+      s = ((r1 - r) + v2) + r2;
+      v1 = r + s;
+      v2 = (r - v1) + s;
+      r = v1 + v2;
+      if (xneg)
+        return -r;
+      else
+        return r;
+    }
+  else
+    {
+      /* abs(x) > 32.0 */
+      if (ax > recrteps)
+        {
+          /* Arguments greater than 1/sqrt(epsilon) in magnitude are
+             approximated by asinh(x) = ln(2) + ln(abs(x)), with sign of x */
+          /* log_kernel_amd(x) returns xexp, r1, r2 such that
+             log(x) = xexp*log(2) + r1 + r2 */
+          log_kernel_amd64(absx, ax, &xexp, &r1, &r2);
+          /* Add (xexp+1) * log(2) to z1,z2 to get the result asinh(x).
+             The computed r1 is not subject to rounding error because
+             (xexp+1) has at most 10 significant bits, log(2) has 24 significant
+             bits, and r1 has up to 24 bits; and the exponents of r1
+             and r2 differ by at most 6. */
+          r1 = ((xexp+1) * log2_lead + r1);
+          r2 = ((xexp+1) * log2_tail + r2);
+          if (xneg)
+            return -(r1 + r2);
+          else
+            return r1 + r2;
+        }
+      else
+        {
+          rarg = absx*absx+1.0;
+          /* Arguments such that 32.0 <= abs(x) <= 1/sqrt(epsilon) are
+             approximated by
+               asinh(x) = ln(abs(x) + sqrt(x*x+1))
+             with the sign of x (see Abramowitz and Stegun 4.6.20) */
+          /* Use assembly instruction to compute r = sqrt(rarg); */
+          ASMSQRT(rarg,r);
+          r += absx;
+          GET_BITS_DP64(r, ax);
+          log_kernel_amd64(r, ax, &xexp, &r1, &r2);
+          r1 = (xexp * log2_lead + r1);
+          r2 = (xexp * log2_tail + r2);
+          if (xneg)
+            return -(r1 + r2);
+          else
+            return r1 + r2;
+        }
+    }
+}
+
+weak_alias (__asinh, asinh)

diff --git a/src/asinhf.c b/src/asinhf.c
new file mode 100644
index 0000000..f5d3bf9
--- /dev/null
+++ b/src/asinhf.c

@@ -0,0 +1,164 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include <stdio.h>
+
+#define USE_HANDLE_ERRORF
+#define USE_VALF_WITH_FLAGS
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERRORF
+#undef VALF_WITH_FLAGS
+
+#undef _FUNCNAME
+#define _FUNCNAME "asinhf"
+float FN_PROTOTYPE(asinhf)(float x)
+{
+
+  double dx;
+  unsigned int ux, ax, xneg;
+  double absx, r, rarg, t, poly;
+
+  static const unsigned int
+    rteps = 0x39800000,    /* sqrt(eps) = 2.44140625000000000000e-04 */
+    recrteps = 0x46000000; /* 1/rteps = 4.09600000000000000000e+03 */
+
+  static const double
+    log2 = 6.93147180559945286227e-01;  /* 0x3fe62e42fefa39ef */
+
+  GET_BITS_SP32(x, ux);
+  ax = ux & ~SIGNBIT_SP32;
+  xneg = ux & SIGNBIT_SP32;
+
+  if ((ux & EXPBITS_SP32) == EXPBITS_SP32)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_SP32)
+        {
+          /* x is NaN */
+#ifdef WINDOWS
+          return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN,
+                               0, EDOM, x, 0.0F);
+#else
+          return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+        }
+      else
+        {
+          /* x is infinity. Return the same infinity. */
+#ifdef WINDOWS
+          if (ux & SIGNBIT_SP32)
+            return handle_errorf(_FUNCNAME, NINFBITPATT_SP32, _DOMAIN,
+                                 AMD_F_INVALID, EDOM, x, 0.0F);
+          else
+            return handle_errorf(_FUNCNAME, PINFBITPATT_SP32, _DOMAIN,
+                                 AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+          return x;
+#endif
+        }
+    }
+  else if (ax < rteps) /* abs(x) < sqrt(epsilon) */
+    {
+      if (ax == 0x00000000)
+        {
+          /* x is +/-zero. Return the same zero. */
+          return x;
+        }
+      else
+        {
+          /* Tiny arguments approximated by asinhf(x) = x
+             - avoid slow operations on denormalized numbers */
+          return valf_with_flags(x,AMD_F_INEXACT);
+        }
+    }
+
+  dx = x;
+  if (xneg)
+    absx = -dx;
+  else
+    absx = dx;
+
+  if (ax <= 0x40800000) /* abs(x) <= 4.0 */
+    {
+      /* Arguments less than 4.0 in magnitude are
+         approximated by [4,4] minimax polynomials
+      */
+      t = dx*dx;
+      if (ax <= 0x40000000) /* abs(x) <= 2 */
+        poly =
+          (-0.1152965835871758072e-1 +
+          (-0.1480204186473758321e-1 +
+          (-0.5063201055468483248e-2 +
+          (-0.4162727710583425360e-3 -
+            0.1177198915954942694e-5 * t) * t) * t) * t) /
+           (0.6917795026025976739e-1 +
+           (0.1199423176003939087e+0 +
+           (0.6582362487198468066e-1 +
+           (0.1260024978680227945e-1 +
+            0.6284381367285534560e-3 * t) * t) * t) * t);
+      else
+        poly =
+           (-0.185462290695578589e-2 +
+           (-0.113672533502734019e-2 +
+           (-0.142208387300570402e-3 +
+           (-0.339546014993079977e-5 -
+             0.151054665394480990e-8 * t) * t) * t) * t) /
+            (0.111486158580024771e-1 +
+            (0.117782437980439561e-1 +
+            (0.325903773532674833e-2 +
+            (0.255902049924065424e-3 +
+             0.434150786948890837e-5 * t) * t) * t) * t);
+      return (float)(dx + dx*t*poly);
+    }
+  else
+    {
+      /* abs(x) > 4.0 */
+      if (ax > recrteps)
+        {
+          /* Arguments greater than 1/sqrt(epsilon) in magnitude are
+             approximated by asinhf(x) = ln(2) + ln(abs(x)), with sign of x */
+          r = FN_PROTOTYPE(log)(absx) + log2;
+        }
+      else
+        {
+          rarg = absx*absx+1.0;
+          /* Arguments such that 4.0 <= abs(x) <= 1/sqrt(epsilon) are
+             approximated by
+               asinhf(x) = ln(abs(x) + sqrt(x*x+1))
+             with the sign of x (see Abramowitz and Stegun 4.6.20) */
+          /* Use assembly instruction to compute r = sqrt(rarg); */
+          ASMSQRT(rarg,r);
+          r += absx;
+          r = FN_PROTOTYPE(log)(r);
+        }
+      if (xneg)
+        return (float)(-r);
+      else
+        return (float)r;
+    }
+}
+
+weak_alias (__asinhf, asinhf)

diff --git a/src/atan.c b/src/atan.c
new file mode 100644
index 0000000..3b99df9
--- /dev/null
+++ b/src/atan.c

@@ -0,0 +1,171 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VAL_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_VAL_WITH_FLAGS
+#undef USE_NAN_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x)
+{
+  struct exception exc;
+  exc.arg1 = x;
+  exc.arg2 = x;
+  exc.name = (char *)"atan";
+  exc.type = DOMAIN;
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = HUGE;
+  else
+    exc.retval = nan_with_flags(AMD_F_INVALID);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("atan: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(atan)
+#endif
+
+double FN_PROTOTYPE(atan)(double x)
+{
+
+  /* Some constants and split constants. */
+
+  static double piby2 = 1.5707963267948966e+00; /* 0x3ff921fb54442d18 */
+  double chi, clo, v, s, q, z;
+
+  /* Find properties of argument x. */
+
+  unsigned long long ux, aux, xneg;
+  GET_BITS_DP64(x, ux);
+  aux = ux & ~SIGNBIT_DP64;
+  xneg = (ux != aux);
+
+  if (xneg) v = -x;
+  else v = x;
+
+  /* Argument reduction to range [-7/16,7/16] */
+
+  if (aux < 0x3e50000000000000) /* v < 2.0^(-26) */
+    {
+      /* x is a good approximation to atan(x) and avoids working on
+         intermediate denormal numbers */
+      if (aux == 0x0000000000000000)
+        return x;
+      else
+        return val_with_flags(x, AMD_F_INEXACT);
+    }
+  else if (aux > 0x4003800000000000) /* v > 39./16. */
+    {
+
+      if (aux > PINFBITPATT_DP64)
+        {
+          /* x is NaN */
+#ifdef WINDOWS
+          return handle_error("atan", ux|0x0008000000000000, _DOMAIN, 0,
+                              EDOM, x, 0.0);
+#else
+          return x + x; /* Raise invalid if it's a signalling NaN */
+#endif
+        }
+      else if (aux > 0x4370000000000000)
+	{ /* abs(x) > 2^56 => arctan(1/x) is
+	     insignificant compared to piby2 */
+	  if (xneg)
+            return val_with_flags(-piby2, AMD_F_INEXACT);
+	  else
+            return val_with_flags(piby2, AMD_F_INEXACT);
+	}
+
+      x = -1.0/v;
+      /* (chi + clo) = arctan(infinity) */
+      chi = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */
+      clo = 6.12323399573676480327e-17; /* 0x3c91a62633145c06 */
+    }
+  else if (aux > 0x3ff3000000000000) /* 39./16. > v > 19./16. */
+    {
+      x = (v-1.5)/(1.0+1.5*v);
+      /* (chi + clo) = arctan(1.5) */
+      chi = 9.82793723247329054082e-01; /* 0x3fef730bd281f69b */
+      clo = 1.39033110312309953701e-17; /* 0x3c7007887af0cbbc */
+    }
+  else if (aux > 0x3fe6000000000000) /* 19./16. > v > 11./16. */
+    {
+      x = (v-1.0)/(1.0+v);
+      /* (chi + clo) = arctan(1.) */
+      chi = 7.85398163397448278999e-01; /* 0x3fe921fb54442d18 */
+      clo = 3.06161699786838240164e-17; /* 0x3c81a62633145c06 */
+    }
+  else if (aux > 0x3fdc000000000000) /* 11./16. > v > 7./16. */
+    {
+      x = (2.0*v-1.0)/(2.0+v);
+      /* (chi + clo) = arctan(0.5) */
+      chi = 4.63647609000806093515e-01; /* 0x3fddac670561bb4f */
+      clo = 2.26987774529616809294e-17; /* 0x3c7a2b7f222f65e0 */
+    }
+  else  /* v < 7./16. */
+    {
+      x = v;
+      chi = 0.0;
+      clo = 0.0;
+    }
+
+  /* Core approximation: Remez(4,4) on [-7/16,7/16] */
+
+  s = x*x;
+  q = x*s*
+       (0.268297920532545909e0 +
+	(0.447677206805497472e0 +
+	 (0.220638780716667420e0 +
+	  (0.304455919504853031e-1 +
+	    0.142316903342317766e-3*s)*s)*s)*s)/
+       (0.804893761597637733e0 +
+	(0.182596787737507063e1 +
+	 (0.141254259931958921e1 +
+	  (0.424602594203847109e0 +
+	    0.389525873944742195e-1*s)*s)*s)*s);
+
+  z = chi - ((q - clo) - x);
+
+  if (xneg) z = -z;
+  return z;
+}
+
+weak_alias (__atan, atan)

diff --git a/src/atan2.c b/src/atan2.c
new file mode 100644
index 0000000..6531ee4
--- /dev/null
+++ b/src/atan2.c

@@ -0,0 +1,785 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VAL_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_SCALEDOUBLE_1
+#define USE_SCALEDOUBLE_2
+#define USE_SCALEUPDOUBLE1024
+#define USE_SCALEDOWNDOUBLE
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_VAL_WITH_FLAGS
+#undef USE_NAN_WITH_FLAGS
+#undef USE_SCALEDOUBLE_1
+#undef USE_SCALEDOUBLE_2
+#undef USE_SCALEUPDOUBLE1024
+#undef USE_SCALEDOWNDOUBLE
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range arguments
+   (only used when _LIB_VERSION is _SVID_) */
+static inline double retval_errno_edom(double x, double y)
+{
+  struct exception exc;
+  exc.arg1 = x;
+  exc.arg2 = y;
+  exc.name = (char *)"atan2";
+  exc.type = DOMAIN;
+  exc.retval = HUGE;
+  if (!matherr(&exc))
+    {
+      (void)fputs("atan2: DOMAIN error\n", stderr);
+      __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(atan2)
+#endif
+
+double FN_PROTOTYPE(atan2)(double y, double x)
+{
+  /* Arrays atan_jby256_lead and atan_jby256_tail contain
+     leading and trailing parts respectively of precomputed
+     values of atan(j/256), for j = 16, 17, ..., 256.
+     atan_jby256_lead contains the first 21 bits of precision,
+     and atan_jby256_tail contains a further 53 bits precision. */
+
+  static const double atan_jby256_lead[  241] = {
+    6.24187886714935302734e-02,  /* 0x3faff55b00000000 */
+    6.63088560104370117188e-02,  /* 0x3fb0f99e00000000 */
+    7.01969265937805175781e-02,  /* 0x3fb1f86d00000000 */
+    7.40829110145568847656e-02,  /* 0x3fb2f71900000000 */
+    7.79666304588317871094e-02,  /* 0x3fb3f59f00000000 */
+    8.18479657173156738281e-02,  /* 0x3fb4f3fd00000000 */
+    8.57268571853637695312e-02,  /* 0x3fb5f23200000000 */
+    8.96031260490417480469e-02,  /* 0x3fb6f03b00000000 */
+    9.34767723083496093750e-02,  /* 0x3fb7ee1800000000 */
+    9.73475575447082519531e-02,  /* 0x3fb8ebc500000000 */
+    1.01215422153472900391e-01,  /* 0x3fb9e94100000000 */
+    1.05080246925354003906e-01,  /* 0x3fbae68a00000000 */
+    1.08941912651062011719e-01,  /* 0x3fbbe39e00000000 */
+    1.12800359725952148438e-01,  /* 0x3fbce07c00000000 */
+    1.16655409336090087891e-01,  /* 0x3fbddd2100000000 */
+    1.20507001876831054688e-01,  /* 0x3fbed98c00000000 */
+    1.24354958534240722656e-01,  /* 0x3fbfd5ba00000000 */
+    1.28199219703674316406e-01,  /* 0x3fc068d500000000 */
+    1.32039666175842285156e-01,  /* 0x3fc0e6ad00000000 */
+    1.35876297950744628906e-01,  /* 0x3fc1646500000000 */
+    1.39708757400512695312e-01,  /* 0x3fc1e1fa00000000 */
+    1.43537282943725585938e-01,  /* 0x3fc25f6e00000000 */
+    1.47361397743225097656e-01,  /* 0x3fc2dcbd00000000 */
+    1.51181221008300781250e-01,  /* 0x3fc359e800000000 */
+    1.54996633529663085938e-01,  /* 0x3fc3d6ee00000000 */
+    1.58807516098022460938e-01,  /* 0x3fc453ce00000000 */
+    1.62613749504089355469e-01,  /* 0x3fc4d08700000000 */
+    1.66415214538574218750e-01,  /* 0x3fc54d1800000000 */
+    1.70211911201477050781e-01,  /* 0x3fc5c98100000000 */
+    1.74003481864929199219e-01,  /* 0x3fc645bf00000000 */
+    1.77790164947509765625e-01,  /* 0x3fc6c1d400000000 */
+    1.81571602821350097656e-01,  /* 0x3fc73dbd00000000 */
+    1.85347914695739746094e-01,  /* 0x3fc7b97b00000000 */
+    1.89118742942810058594e-01,  /* 0x3fc8350b00000000 */
+    1.92884206771850585938e-01,  /* 0x3fc8b06e00000000 */
+    1.96644186973571777344e-01,  /* 0x3fc92ba300000000 */
+    2.00398445129394531250e-01,  /* 0x3fc9a6a800000000 */
+    2.04147100448608398438e-01,  /* 0x3fca217e00000000 */
+    2.07889914512634277344e-01,  /* 0x3fca9c2300000000 */
+    2.11626768112182617188e-01,  /* 0x3fcb169600000000 */
+    2.15357661247253417969e-01,  /* 0x3fcb90d700000000 */
+    2.19082474708557128906e-01,  /* 0x3fcc0ae500000000 */
+    2.22801089286804199219e-01,  /* 0x3fcc84bf00000000 */
+    2.26513504981994628906e-01,  /* 0x3fccfe6500000000 */
+    2.30219483375549316406e-01,  /* 0x3fcd77d500000000 */
+    2.33919143676757812500e-01,  /* 0x3fcdf11000000000 */
+    2.37612247467041015625e-01,  /* 0x3fce6a1400000000 */
+    2.41298794746398925781e-01,  /* 0x3fcee2e100000000 */
+    2.44978547096252441406e-01,  /* 0x3fcf5b7500000000 */
+    2.48651623725891113281e-01,  /* 0x3fcfd3d100000000 */
+    2.52317905426025390625e-01,  /* 0x3fd025fa00000000 */
+    2.55977153778076171875e-01,  /* 0x3fd061ee00000000 */
+    2.59629487991333007812e-01,  /* 0x3fd09dc500000000 */
+    2.63274669647216796875e-01,  /* 0x3fd0d97e00000000 */
+    2.66912937164306640625e-01,  /* 0x3fd1151a00000000 */
+    2.70543813705444335938e-01,  /* 0x3fd1509700000000 */
+    2.74167299270629882812e-01,  /* 0x3fd18bf500000000 */
+    2.77783632278442382812e-01,  /* 0x3fd1c73500000000 */
+    2.81392335891723632812e-01,  /* 0x3fd2025500000000 */
+    2.84993648529052734375e-01,  /* 0x3fd23d5600000000 */
+    2.88587331771850585938e-01,  /* 0x3fd2783700000000 */
+    2.92173147201538085938e-01,  /* 0x3fd2b2f700000000 */
+    2.95751571655273437500e-01,  /* 0x3fd2ed9800000000 */
+    2.99322128295898437500e-01,  /* 0x3fd3281800000000 */
+    3.02884817123413085938e-01,  /* 0x3fd3627700000000 */
+    3.06439399719238281250e-01,  /* 0x3fd39cb400000000 */
+    3.09986352920532226562e-01,  /* 0x3fd3d6d100000000 */
+    3.13524961471557617188e-01,  /* 0x3fd410cb00000000 */
+    3.17055702209472656250e-01,  /* 0x3fd44aa400000000 */
+    3.20578098297119140625e-01,  /* 0x3fd4845a00000000 */
+    3.24092388153076171875e-01,  /* 0x3fd4bdee00000000 */
+    3.27598333358764648438e-01,  /* 0x3fd4f75f00000000 */
+    3.31095933914184570312e-01,  /* 0x3fd530ad00000000 */
+    3.34585189819335937500e-01,  /* 0x3fd569d800000000 */
+    3.38066101074218750000e-01,  /* 0x3fd5a2e000000000 */
+    3.41538190841674804688e-01,  /* 0x3fd5dbc300000000 */
+    3.45002174377441406250e-01,  /* 0x3fd6148400000000 */
+    3.48457098007202148438e-01,  /* 0x3fd64d1f00000000 */
+    3.51903676986694335938e-01,  /* 0x3fd6859700000000 */
+    3.55341434478759765625e-01,  /* 0x3fd6bdea00000000 */
+    3.58770608901977539062e-01,  /* 0x3fd6f61900000000 */
+    3.62190723419189453125e-01,  /* 0x3fd72e2200000000 */
+    3.65602254867553710938e-01,  /* 0x3fd7660700000000 */
+    3.69004726409912109375e-01,  /* 0x3fd79dc600000000 */
+    3.72398376464843750000e-01,  /* 0x3fd7d56000000000 */
+    3.75782966613769531250e-01,  /* 0x3fd80cd400000000 */
+    3.79158496856689453125e-01,  /* 0x3fd8442200000000 */
+    3.82525205612182617188e-01,  /* 0x3fd87b4b00000000 */
+    3.85882616043090820312e-01,  /* 0x3fd8b24d00000000 */
+    3.89230966567993164062e-01,  /* 0x3fd8e92900000000 */
+    3.92570018768310546875e-01,  /* 0x3fd91fde00000000 */
+    3.95900011062622070312e-01,  /* 0x3fd9566d00000000 */
+    3.99220705032348632812e-01,  /* 0x3fd98cd500000000 */
+    4.02532100677490234375e-01,  /* 0x3fd9c31600000000 */
+    4.05834197998046875000e-01,  /* 0x3fd9f93000000000 */
+    4.09126996994018554688e-01,  /* 0x3fda2f2300000000 */
+    4.12410259246826171875e-01,  /* 0x3fda64ee00000000 */
+    4.15684223175048828125e-01,  /* 0x3fda9a9200000000 */
+    4.18948888778686523438e-01,  /* 0x3fdad00f00000000 */
+    4.22204017639160156250e-01,  /* 0x3fdb056400000000 */
+    4.25449609756469726562e-01,  /* 0x3fdb3a9100000000 */
+    4.28685665130615234375e-01,  /* 0x3fdb6f9600000000 */
+    4.31912183761596679688e-01,  /* 0x3fdba47300000000 */
+    4.35129165649414062500e-01,  /* 0x3fdbd92800000000 */
+    4.38336372375488281250e-01,  /* 0x3fdc0db400000000 */
+    4.41534280776977539062e-01,  /* 0x3fdc421900000000 */
+    4.44722414016723632812e-01,  /* 0x3fdc765500000000 */
+    4.47900772094726562500e-01,  /* 0x3fdcaa6800000000 */
+    4.51069593429565429688e-01,  /* 0x3fdcde5300000000 */
+    4.54228639602661132812e-01,  /* 0x3fdd121500000000 */
+    4.57377910614013671875e-01,  /* 0x3fdd45ae00000000 */
+    4.60517644882202148438e-01,  /* 0x3fdd791f00000000 */
+    4.63647603988647460938e-01,  /* 0x3fddac6700000000 */
+    4.66767549514770507812e-01,  /* 0x3fdddf8500000000 */
+    4.69877958297729492188e-01,  /* 0x3fde127b00000000 */
+    4.72978591918945312500e-01,  /* 0x3fde454800000000 */
+    4.76069211959838867188e-01,  /* 0x3fde77eb00000000 */
+    4.79150056838989257812e-01,  /* 0x3fdeaa6500000000 */
+    4.82221126556396484375e-01,  /* 0x3fdedcb600000000 */
+    4.85282421112060546875e-01,  /* 0x3fdf0ede00000000 */
+    4.88333940505981445312e-01,  /* 0x3fdf40dd00000000 */
+    4.91375446319580078125e-01,  /* 0x3fdf72b200000000 */
+    4.94406938552856445312e-01,  /* 0x3fdfa45d00000000 */
+    4.97428894042968750000e-01,  /* 0x3fdfd5e000000000 */
+    5.00440597534179687500e-01,  /* 0x3fe0039c00000000 */
+    5.03442764282226562500e-01,  /* 0x3fe01c3400000000 */
+    5.06434917449951171875e-01,  /* 0x3fe034b700000000 */
+    5.09417057037353515625e-01,  /* 0x3fe04d2500000000 */
+    5.12389183044433593750e-01,  /* 0x3fe0657e00000000 */
+    5.15351772308349609375e-01,  /* 0x3fe07dc300000000 */
+    5.18304347991943359375e-01,  /* 0x3fe095f300000000 */
+    5.21246910095214843750e-01,  /* 0x3fe0ae0e00000000 */
+    5.24179458618164062500e-01,  /* 0x3fe0c61400000000 */
+    5.27101993560791015625e-01,  /* 0x3fe0de0500000000 */
+    5.30014991760253906250e-01,  /* 0x3fe0f5e200000000 */
+    5.32917976379394531250e-01,  /* 0x3fe10daa00000000 */
+    5.35810947418212890625e-01,  /* 0x3fe1255d00000000 */
+    5.38693904876708984375e-01,  /* 0x3fe13cfb00000000 */
+    5.41567325592041015625e-01,  /* 0x3fe1548500000000 */
+    5.44430732727050781250e-01,  /* 0x3fe16bfa00000000 */
+    5.47284126281738281250e-01,  /* 0x3fe1835a00000000 */
+    5.50127506256103515625e-01,  /* 0x3fe19aa500000000 */
+    5.52961349487304687500e-01,  /* 0x3fe1b1dc00000000 */
+    5.55785179138183593750e-01,  /* 0x3fe1c8fe00000000 */
+    5.58598995208740234375e-01,  /* 0x3fe1e00b00000000 */
+    5.61403274536132812500e-01,  /* 0x3fe1f70400000000 */
+    5.64197540283203125000e-01,  /* 0x3fe20de800000000 */
+    5.66981792449951171875e-01,  /* 0x3fe224b700000000 */
+    5.69756031036376953125e-01,  /* 0x3fe23b7100000000 */
+    5.72520732879638671875e-01,  /* 0x3fe2521700000000 */
+    5.75275897979736328125e-01,  /* 0x3fe268a900000000 */
+    5.78021049499511718750e-01,  /* 0x3fe27f2600000000 */
+    5.80756187438964843750e-01,  /* 0x3fe2958e00000000 */
+    5.83481788635253906250e-01,  /* 0x3fe2abe200000000 */
+    5.86197376251220703125e-01,  /* 0x3fe2c22100000000 */
+    5.88903427124023437500e-01,  /* 0x3fe2d84c00000000 */
+    5.91599464416503906250e-01,  /* 0x3fe2ee6200000000 */
+    5.94285964965820312500e-01,  /* 0x3fe3046400000000 */
+    5.96962928771972656250e-01,  /* 0x3fe31a5200000000 */
+    5.99629878997802734375e-01,  /* 0x3fe3302b00000000 */
+    6.02287292480468750000e-01,  /* 0x3fe345f000000000 */
+    6.04934692382812500000e-01,  /* 0x3fe35ba000000000 */
+    6.07573032379150390625e-01,  /* 0x3fe3713d00000000 */
+    6.10201358795166015625e-01,  /* 0x3fe386c500000000 */
+    6.12820148468017578125e-01,  /* 0x3fe39c3900000000 */
+    6.15428924560546875000e-01,  /* 0x3fe3b19800000000 */
+    6.18028640747070312500e-01,  /* 0x3fe3c6e400000000 */
+    6.20618820190429687500e-01,  /* 0x3fe3dc1c00000000 */
+    6.23198986053466796875e-01,  /* 0x3fe3f13f00000000 */
+    6.25770092010498046875e-01,  /* 0x3fe4064f00000000 */
+    6.28331184387207031250e-01,  /* 0x3fe41b4a00000000 */
+    6.30883216857910156250e-01,  /* 0x3fe4303200000000 */
+    6.33425712585449218750e-01,  /* 0x3fe4450600000000 */
+    6.35958671569824218750e-01,  /* 0x3fe459c600000000 */
+    6.38482093811035156250e-01,  /* 0x3fe46e7200000000 */
+    6.40995979309082031250e-01,  /* 0x3fe4830a00000000 */
+    6.43500804901123046875e-01,  /* 0x3fe4978f00000000 */
+    6.45996093750000000000e-01,  /* 0x3fe4ac0000000000 */
+    6.48482322692871093750e-01,  /* 0x3fe4c05e00000000 */
+    6.50959014892578125000e-01,  /* 0x3fe4d4a800000000 */
+    6.53426170349121093750e-01,  /* 0x3fe4e8de00000000 */
+    6.55884265899658203125e-01,  /* 0x3fe4fd0100000000 */
+    6.58332824707031250000e-01,  /* 0x3fe5111000000000 */
+    6.60772323608398437500e-01,  /* 0x3fe5250c00000000 */
+    6.63202762603759765625e-01,  /* 0x3fe538f500000000 */
+    6.65623664855957031250e-01,  /* 0x3fe54cca00000000 */
+    6.68035984039306640625e-01,  /* 0x3fe5608d00000000 */
+    6.70438766479492187500e-01,  /* 0x3fe5743c00000000 */
+    6.72832489013671875000e-01,  /* 0x3fe587d800000000 */
+    6.75216674804687500000e-01,  /* 0x3fe59b6000000000 */
+    6.77592277526855468750e-01,  /* 0x3fe5aed600000000 */
+    6.79958820343017578125e-01,  /* 0x3fe5c23900000000 */
+    6.82316303253173828125e-01,  /* 0x3fe5d58900000000 */
+    6.84664726257324218750e-01,  /* 0x3fe5e8c600000000 */
+    6.87004089355468750000e-01,  /* 0x3fe5fbf000000000 */
+    6.89334869384765625000e-01,  /* 0x3fe60f0800000000 */
+    6.91656589508056640625e-01,  /* 0x3fe6220d00000000 */
+    6.93969249725341796875e-01,  /* 0x3fe634ff00000000 */
+    6.96272850036621093750e-01,  /* 0x3fe647de00000000 */
+    6.98567867279052734375e-01,  /* 0x3fe65aab00000000 */
+    7.00854301452636718750e-01,  /* 0x3fe66d6600000000 */
+    7.03131675720214843750e-01,  /* 0x3fe6800e00000000 */
+    7.05400466918945312500e-01,  /* 0x3fe692a400000000 */
+    7.07660198211669921875e-01,  /* 0x3fe6a52700000000 */
+    7.09911346435546875000e-01,  /* 0x3fe6b79800000000 */
+    7.12153911590576171875e-01,  /* 0x3fe6c9f700000000 */
+    7.14387893676757812500e-01,  /* 0x3fe6dc4400000000 */
+    7.16613292694091796875e-01,  /* 0x3fe6ee7f00000000 */
+    7.18829631805419921875e-01,  /* 0x3fe700a700000000 */
+    7.21037864685058593750e-01,  /* 0x3fe712be00000000 */
+    7.23237514495849609375e-01,  /* 0x3fe724c300000000 */
+    7.25428581237792968750e-01,  /* 0x3fe736b600000000 */
+    7.27611064910888671875e-01,  /* 0x3fe7489700000000 */
+    7.29785442352294921875e-01,  /* 0x3fe75a6700000000 */
+    7.31950759887695312500e-01,  /* 0x3fe76c2400000000 */
+    7.34108448028564453125e-01,  /* 0x3fe77dd100000000 */
+    7.36257076263427734375e-01,  /* 0x3fe78f6b00000000 */
+    7.38397598266601562500e-01,  /* 0x3fe7a0f400000000 */
+    7.40530014038085937500e-01,  /* 0x3fe7b26c00000000 */
+    7.42654323577880859375e-01,  /* 0x3fe7c3d300000000 */
+    7.44770050048828125000e-01,  /* 0x3fe7d52800000000 */
+    7.46877670288085937500e-01,  /* 0x3fe7e66c00000000 */
+    7.48976707458496093750e-01,  /* 0x3fe7f79e00000000 */
+    7.51068115234375000000e-01,  /* 0x3fe808c000000000 */
+    7.53150939941406250000e-01,  /* 0x3fe819d000000000 */
+    7.55226135253906250000e-01,  /* 0x3fe82ad000000000 */
+    7.57292747497558593750e-01,  /* 0x3fe83bbe00000000 */
+    7.59351730346679687500e-01,  /* 0x3fe84c9c00000000 */
+    7.61402606964111328125e-01,  /* 0x3fe85d6900000000 */
+    7.63445377349853515625e-01,  /* 0x3fe86e2500000000 */
+    7.65480041503906250000e-01,  /* 0x3fe87ed000000000 */
+    7.67507076263427734375e-01,  /* 0x3fe88f6b00000000 */
+    7.69526004791259765625e-01,  /* 0x3fe89ff500000000 */
+    7.71537303924560546875e-01,  /* 0x3fe8b06f00000000 */
+    7.73540973663330078125e-01,  /* 0x3fe8c0d900000000 */
+    7.75536537170410156250e-01,  /* 0x3fe8d13200000000 */
+    7.77523994445800781250e-01,  /* 0x3fe8e17a00000000 */
+    7.79504299163818359375e-01,  /* 0x3fe8f1b300000000 */
+    7.81476497650146484375e-01,  /* 0x3fe901db00000000 */
+    7.83441066741943359375e-01,  /* 0x3fe911f300000000 */
+    7.85398006439208984375e-01}; /* 0x3fe921fb00000000 */
+
+  static const double atan_jby256_tail[  241] = {
+    2.13244638182005395671e-08,  /* 0x3e56e59fbd38db2c */
+    3.89093864761712760656e-08,  /* 0x3e64e3aa54dedf96 */
+    4.44780900009437454576e-08,  /* 0x3e67e105ab1bda88 */
+    1.15344768460112754160e-08,  /* 0x3e48c5254d013fd0 */
+    3.37271051945395312705e-09,  /* 0x3e2cf8ab3ad62670 */
+    2.40857608736109859459e-08,  /* 0x3e59dca4bec80468 */
+    1.85853810450623807768e-08,  /* 0x3e53f4b5ec98a8da */
+    5.14358299969225078306e-08,  /* 0x3e6b9d49619d81fe */
+    8.85023985412952486748e-09,  /* 0x3e43017887460934 */
+    1.59425154214358432060e-08,  /* 0x3e511e3eca0b9944 */
+    1.95139937737755753164e-08,  /* 0x3e54f3f73c5a332e */
+    2.64909755273544319715e-08,  /* 0x3e5c71c8ae0e00a6 */
+    4.43388037881231070144e-08,  /* 0x3e67cde0f86fbdc7 */
+    2.14757072421821274557e-08,  /* 0x3e570f328c889c72 */
+    2.61049792670754218852e-08,  /* 0x3e5c07ae9b994efe */
+    7.81439350674466302231e-09,  /* 0x3e40c8021d7b1698 */
+    3.60125207123751024094e-08,  /* 0x3e635585edb8cb22 */
+    6.15276238179343767917e-08,  /* 0x3e70842567b30e96 */
+    9.54387964641184285058e-08,  /* 0x3e799e811031472e */
+    3.02789566851502754129e-08,  /* 0x3e6041821416bcee */
+    1.16888650949870856331e-07,  /* 0x3e7f6086e4dc96f4 */
+    1.07580956468653338863e-08,  /* 0x3e471a535c5f1b58 */
+    8.33454265379535427653e-08,  /* 0x3e765f743fe63ca1 */
+    1.10790279272629526068e-07,  /* 0x3e7dbd733472d014 */
+    1.08394277896366207424e-07,  /* 0x3e7d18cc4d8b0d1d */
+    9.22176086126841098800e-08,  /* 0x3e78c12553c8fb29 */
+    7.90938592199048786990e-08,  /* 0x3e753b49e2e8f991 */
+    8.66445407164293125637e-08,  /* 0x3e77422ae148c141 */
+    1.40839973537092438671e-08,  /* 0x3e4e3ec269df56a8 */
+    1.19070438507307600689e-07,  /* 0x3e7ff6754e7e0ac9 */
+    6.40451663051716197071e-08,  /* 0x3e7131267b1b5aad */
+    1.08338682076343674522e-07,  /* 0x3e7d14fa403a94bc */
+    3.52999550187922736222e-08,  /* 0x3e62f396c089a3d8 */
+    1.05983273930043077202e-07,  /* 0x3e7c731d78fa95bb */
+    1.05486124078259553339e-07,  /* 0x3e7c50f385177399 */
+    5.82167732281776477773e-08,  /* 0x3e6f41409c6f2c20 */
+    1.08696483983403942633e-07,  /* 0x3e7d2d90c4c39ec0 */
+    4.47335086122377542835e-08,  /* 0x3e680420696f2106 */
+    1.26896287162615723528e-08,  /* 0x3e4b40327943a2e8 */
+    4.06534471589151404531e-08,  /* 0x3e65d35e02f3d2a2 */
+    3.84504846300557026690e-08,  /* 0x3e64a498288117b0 */
+    3.60715006404807269080e-08,  /* 0x3e635da119afb324 */
+    6.44725903165522722801e-08,  /* 0x3e714e85cdb9a908 */
+    3.63749249976409461305e-08,  /* 0x3e638754e5547b9a */
+    1.03901294413833913794e-07,  /* 0x3e7be40ae6ce3246 */
+    6.25379756302167880580e-08,  /* 0x3e70c993b3bea7e7 */
+    6.63984302368488828029e-08,  /* 0x3e71d2dd89ac3359 */
+    3.21844598971548278059e-08,  /* 0x3e61476603332c46 */
+    1.16030611712765830905e-07,  /* 0x3e7f25901bac55b7 */
+    1.17464622142347730134e-07,  /* 0x3e7f881b7c826e28 */
+    7.54604017965808996596e-08,  /* 0x3e7441996d698d20 */
+    1.49234929356206556899e-07,  /* 0x3e8407ac521ea089 */
+    1.41416924523217430259e-07,  /* 0x3e82fb0c6c4b1723 */
+    2.13308065617483489011e-07,  /* 0x3e8ca135966a3e18 */
+    5.04230937933302320146e-08,  /* 0x3e6b1218e4d646e4 */
+    5.45874922281655519035e-08,  /* 0x3e6d4e72a350d288 */
+    1.51849028914786868886e-07,  /* 0x3e84617e2f04c329 */
+    3.09004308703769273010e-08,  /* 0x3e6096ec41e82650 */
+    9.67574548184738317664e-08,  /* 0x3e79f91f25773e6e */
+    4.02508285529322212824e-08,  /* 0x3e659c0820f1d674 */
+    3.01222268096861091157e-08,  /* 0x3e602bf7a2df1064 */
+    2.36189860670079288680e-07,  /* 0x3e8fb36bfc40508f */
+    1.14095158111080887695e-07,  /* 0x3e7ea08f3f8dc892 */
+    7.42349089746573467487e-08,  /* 0x3e73ed6254656a0e */
+    5.12515583196230380184e-08,  /* 0x3e6b83f5e5e69c58 */
+    2.19290391828763918102e-07,  /* 0x3e8d6ec2af768592 */
+    3.83263512187553886471e-08,  /* 0x3e6493889a226f94 */
+    1.61513486284090523855e-07,  /* 0x3e85ad8fa65279ba */
+    5.09996743535589922261e-08,  /* 0x3e6b615784d45434 */
+    1.23694037861246766534e-07,  /* 0x3e809a184368f145 */
+    8.23367955351123783984e-08,  /* 0x3e761a2439b0d91c */
+    1.07591766213053694014e-07,  /* 0x3e7ce1a65e39a978 */
+    1.42789947524631815640e-07,  /* 0x3e832a39a93b6a66 */
+    1.32347123024711878538e-07,  /* 0x3e81c3699af804e7 */
+    2.17626067316598149229e-08,  /* 0x3e575e0f4e44ede8 */
+    2.34454866923044288656e-07,  /* 0x3e8f77ced1a7a83b */
+    2.82966370261766916053e-09,  /* 0x3e284e7f0cb1b500 */
+    2.29300919890907632975e-07,  /* 0x3e8ec6b838b02dfe */
+    1.48428270450261284915e-07,  /* 0x3e83ebf4dfbeda87 */
+    1.87937408574313982512e-07,  /* 0x3e89397aed9cb475 */
+    6.13685946813334055347e-08,  /* 0x3e707937bc239c54 */
+    1.98585022733583817493e-07,  /* 0x3e8aa754553131b6 */
+    7.68394131623752961662e-08,  /* 0x3e74a05d407c45dc */
+    1.28119052312436745644e-07,  /* 0x3e8132231a206dd0 */
+    7.02119104719236502733e-08,  /* 0x3e72d8ecfdd69c88 */
+    9.87954793820636301943e-08,  /* 0x3e7a852c74218606 */
+    1.72176752381034986217e-07,  /* 0x3e871bf2baeebb50 */
+    1.12877225146169704119e-08,  /* 0x3e483d7db7491820 */
+    5.33549829555851737993e-08,  /* 0x3e6ca50d92b6da14 */
+    2.13833275710816521345e-08,  /* 0x3e56f5cde8530298 */
+    1.16243518048290556393e-07,  /* 0x3e7f343198910740 */
+    6.29926408369055877943e-08,  /* 0x3e70e8d241ccd80a */
+    6.45429039328021963791e-08,  /* 0x3e71535ac619e6c8 */
+    8.64001922814281933403e-08,  /* 0x3e77316041c36cd2 */
+    9.50767572202325800240e-08,  /* 0x3e7985a000637d8e */
+    5.80851497508121135975e-08,  /* 0x3e6f2f29858c0a68 */
+    1.82350561135024766232e-07,  /* 0x3e8879847f96d909 */
+    1.98948680587390608655e-07,  /* 0x3e8ab3d319e12e42 */
+    7.83548663450197659846e-08,  /* 0x3e75088162dfc4c2 */
+    3.04374234486798594427e-08,  /* 0x3e605749a1cd9d8c */
+    2.76135725629797411787e-08,  /* 0x3e5da65c6c6b8618 */
+    4.32610105454203065470e-08,  /* 0x3e6739bf7df1ad64 */
+    5.17107515324127256994e-08,  /* 0x3e6bc31252aa3340 */
+    2.82398327875841444660e-08,  /* 0x3e5e528191ad3aa8 */
+    1.87482469524195595399e-07,  /* 0x3e8929d93df19f18 */
+    2.97481891662714096139e-08,  /* 0x3e5ff11eb693a080 */
+    9.94421570843584316402e-09,  /* 0x3e455ae3f145a3a0 */
+    1.07056210730391848428e-07,  /* 0x3e7cbcd8c6c0ca82 */
+    6.25589580466881163081e-08,  /* 0x3e70cb04d425d304 */
+    9.56641013869464593803e-08,  /* 0x3e79adfcab5be678 */
+    1.88056307148355440276e-07,  /* 0x3e893d90c5662508 */
+    8.38850689379557880950e-08,  /* 0x3e768489bd35ff40 */
+    5.01215865527674122924e-09,  /* 0x3e3586ed3da2b7e0 */
+    1.74166095998522089762e-07,  /* 0x3e87604d2e850eee */
+    9.96779574395363585849e-08,  /* 0x3e7ac1d12bfb53d8 */
+    5.98432026368321460686e-09,  /* 0x3e39b3d468274740 */
+    1.18362922366887577169e-07,  /* 0x3e7fc5d68d10e53c */
+    1.86086833284154215946e-07,  /* 0x3e88f9e51884becb */
+    1.97671457251348941011e-07,  /* 0x3e8a87f0869c06d1 */
+    1.42447160717199237159e-07,  /* 0x3e831e7279f685fa */
+    1.05504240785546574184e-08,  /* 0x3e46a8282f9719b0 */
+    3.13335218371639189324e-08,  /* 0x3e60d2724a8a44e0 */
+    1.96518418901914535399e-07,  /* 0x3e8a60524b11ad4e */
+    2.17692035039173536059e-08,  /* 0x3e575fdf832750f0 */
+    2.15613114426529981675e-07,  /* 0x3e8cf06902e4cd36 */
+    5.68271098300441214948e-08,  /* 0x3e6e82422d4f6d10 */
+    1.70331455823369124256e-08,  /* 0x3e524a091063e6c0 */
+    9.17590028095709583247e-08,  /* 0x3e78a1a172dc6f38 */
+    2.77266304112916566247e-07,  /* 0x3e929b6619f8a92d */
+    9.37041937614656939690e-08,  /* 0x3e79274d9c1b70c8 */
+    1.56116346368316796511e-08,  /* 0x3e50c34b1fbb7930 */
+    4.13967433808382727413e-08,  /* 0x3e6639866c20eb50 */
+    1.70164749185821616276e-07,  /* 0x3e86d6d0f6832e9e */
+    4.01708788545600086008e-07,  /* 0x3e9af54def99f25e */
+    2.59663539226050551563e-07,  /* 0x3e916cfc52a00262 */
+    2.22007487655027469542e-07,  /* 0x3e8dcc1e83569c32 */
+    2.90542250809644081369e-07,  /* 0x3e937f7a551ed425 */
+    4.67720537666628903341e-07,  /* 0x3e9f6360adc98887 */
+    2.79799803956772554802e-07,  /* 0x3e92c6ec8d35a2c1 */
+    2.07344552327432547723e-07,  /* 0x3e8bd44df84cb036 */
+    2.54705698692735196368e-07,  /* 0x3e9117cf826e310e */
+    4.26848589539548450728e-07,  /* 0x3e9ca533f332cfc9 */
+    2.52506723633552216197e-07,  /* 0x3e90f208509dbc2e */
+    2.14684129933849704964e-07,  /* 0x3e8cd07d93c945de */
+    3.20134822201596505431e-07,  /* 0x3e957bdfd67e6d72 */
+    9.93537565749855712134e-08,  /* 0x3e7aab89c516c658 */
+    3.70792944827917252327e-08,  /* 0x3e63e823b1a1b8a0 */
+    1.41772749369083698972e-07,  /* 0x3e8307464a9d6d3c */
+    4.22446601490198804306e-07,  /* 0x3e9c5993cd438843 */
+    4.11818433724801511540e-07,  /* 0x3e9ba2fca02ab554 */
+    1.19976381502605310519e-07,  /* 0x3e801a5b6983a268 */
+    3.43703078571520905265e-08,  /* 0x3e6273d1b350efc8 */
+    1.66128705555453270379e-07,  /* 0x3e864c238c37b0c6 */
+    5.00499610023283006540e-08,  /* 0x3e6aded07370a300 */
+    1.75105139941208062123e-07,  /* 0x3e878091197eb47e */
+    7.70807146729030327334e-08,  /* 0x3e74b0f245e0dabc */
+    2.45918607526895836121e-07,  /* 0x3e9080d9794e2eaf */
+    2.18359020958626199345e-07,  /* 0x3e8d4ec242b60c76 */
+    8.44342887976445333569e-09,  /* 0x3e4221d2f940caa0 */
+    1.07506148687888629299e-07,  /* 0x3e7cdbc42b2bba5c */
+    5.36544954316820904572e-08,  /* 0x3e6cce37bb440840 */
+    3.39109101518396596341e-07,  /* 0x3e96c1d999cf1dd0 */
+    2.60098720293920613340e-08,  /* 0x3e5bed8a07eb0870 */
+    8.42678991664621455827e-08,  /* 0x3e769ed88f490e3c */
+    5.36972237470183633197e-08,  /* 0x3e6cd41719b73ef0 */
+    4.28192558171921681288e-07,  /* 0x3e9cbc4ac95b41b7 */
+    2.71535491483955143294e-07,  /* 0x3e9238f1b890f5d7 */
+    7.84094998145075780203e-08,  /* 0x3e750c4282259cc4 */
+    3.43880599134117431863e-07,  /* 0x3e9713d2de87b3e2 */
+    1.32878065060366481043e-07,  /* 0x3e81d5a7d2255276 */
+    4.18046802627967629428e-07,  /* 0x3e9c0dfd48227ac1 */
+    2.65042411765766019424e-07,  /* 0x3e91c964dab76753 */
+    1.70383695347518643694e-07,  /* 0x3e86de56d5704496 */
+    1.54096497259613515678e-07,  /* 0x3e84aeb71fd19968 */
+    2.36543402412459813461e-07,  /* 0x3e8fbf91c57b1918 */
+    4.38416350106876736790e-07,  /* 0x3e9d6bef7fbe5d9a */
+    3.03892161339927775731e-07,  /* 0x3e9464d3dc249066 */
+    3.31136771605664899240e-07,  /* 0x3e9638e2ec4d9073 */
+    6.49494294526590682218e-08,  /* 0x3e716f4a7247ea7c */
+    4.10423429887181345747e-09,  /* 0x3e31a0a740f1d440 */
+    1.70831640869113847224e-07,  /* 0x3e86edbb0114a33c */
+    1.10811512657909180966e-07,  /* 0x3e7dbee8bf1d513c */
+    3.23677724749783611964e-07,  /* 0x3e95b8bdb0248f73 */
+    3.55662734259192678528e-07,  /* 0x3e97de3d3f5eac64 */
+    2.30102333489738219140e-07,  /* 0x3e8ee24187ae448a */
+    4.47429004000738629714e-07,  /* 0x3e9e06c591ec5192 */
+    7.78167135617329598659e-08,  /* 0x3e74e3861a332738 */
+    9.90345291908535415737e-08,  /* 0x3e7a9599dcc2bfe4 */
+    5.85800913143113728314e-08,  /* 0x3e6f732fbad43468 */
+    4.57859062410871843857e-07,  /* 0x3e9eb9f573b727d9 */
+    3.67993069723390929794e-07,  /* 0x3e98b212a2eb9897 */
+    2.90836464322977276043e-07,  /* 0x3e9384884c167215 */
+    2.51621574250131388318e-07,  /* 0x3e90e2d363020051 */
+    2.75789824740652815545e-07,  /* 0x3e92820879fbd022 */
+    3.88985776250314403593e-07,  /* 0x3e9a1ab9893e4b30 */
+    1.40214080183768019611e-07,  /* 0x3e82d1b817a24478 */
+    3.23451432223550478373e-08,  /* 0x3e615d7b8ded4878 */
+    9.15979180730608444470e-08,  /* 0x3e78968f9db3a5e4 */
+    3.44371402498640470421e-07,  /* 0x3e971c4171fe135f */
+    3.40401897215059498077e-07,  /* 0x3e96d80f605d0d8c */
+    1.06431813453707950243e-07,  /* 0x3e7c91f043691590 */
+    1.46204238932338846248e-07,  /* 0x3e839f8a15fce2b2 */
+    9.94610376972039046878e-09,  /* 0x3e455beda9d94b80 */
+    2.01711528092681771039e-07,  /* 0x3e8b12c15d60949a */
+    2.72027977986191568296e-07,  /* 0x3e924167b312bfe3 */
+    2.48402602511693757964e-07,  /* 0x3e90ab8633070277 */
+    1.58480011219249621715e-07,  /* 0x3e854554ebbc80ee */
+    3.00372828113368713281e-08,  /* 0x3e60204aef5a4bb8 */
+    3.67816204583541976394e-07,  /* 0x3e98af08c679cf2c */
+    2.46169793032343824291e-07,  /* 0x3e90852a330ae6c8 */
+    1.70080468270204253247e-07,  /* 0x3e86d3eb9ec32916 */
+    1.67806717763872914315e-07,  /* 0x3e8685cb7fcbbafe */
+    2.67715622006907942620e-07,  /* 0x3e91f751c1e0bd95 */
+    2.14411342550299170574e-08,  /* 0x3e5705b1b0f72560 */
+    4.11228221283669073277e-07,  /* 0x3e9b98d8d808ca92 */
+    3.52311752396749662260e-08,  /* 0x3e62ea22c75cc980 */
+    3.52718000397367821054e-07,  /* 0x3e97aba62bca0350 */
+    4.38857387992911129814e-07,  /* 0x3e9d73833442278c */
+    3.22574606753482540743e-07,  /* 0x3e95a5ca1fb18bf9 */
+    3.28730371182804296828e-08,  /* 0x3e61a6092b6ecf28 */
+    7.56672470607639279700e-08,  /* 0x3e744fd049aac104 */
+    3.26750155316369681821e-09,  /* 0x3e2c114fd8df5180 */
+    3.21724445362095284743e-07,  /* 0x3e95972f130feae5 */
+    1.06639427371776571151e-07,  /* 0x3e7ca034a55fe198 */
+    3.41020788139524715063e-07,  /* 0x3e96e2b149990227 */
+    1.00582838631232552824e-07,  /* 0x3e7b00000294592c */
+    3.68439433859276640065e-07,  /* 0x3e98b9bdc442620e */
+    2.20403078342388012027e-07,  /* 0x3e8d94fdfabf3e4e */
+    1.62841467098298142534e-07,  /* 0x3e85db30b145ad9a */
+    2.25325348296680733838e-07,  /* 0x3e8e3e1eb95022b0 */
+    4.37462238226421614339e-07,  /* 0x3e9d5b8b45442bd6 */
+    3.52055880555040706500e-07,  /* 0x3e97a046231ecd2e */
+    4.75614398494781776825e-07,  /* 0x3e9feafe3ef55232 */
+    3.60998399033215317516e-07,  /* 0x3e9839e7bfd78267 */
+    3.79292434611513945954e-08,  /* 0x3e645cf49d6fa900 */
+    1.29859015528549300061e-08,  /* 0x3e4be3132b27f380 */
+    3.15927546985474913188e-07,  /* 0x3e9533980bb84f9f */
+    2.28533679887379668031e-08,  /* 0x3e5889e2ce3ba390 */
+    1.17222541823553133877e-07,  /* 0x3e7f7778c3ad0cc8 */
+    1.51991208405464415857e-07,  /* 0x3e846660cec4eba2 */
+    1.56958239325240655564e-07}; /* 0x3e85110b4611a626 */
+
+  /* Some constants and split constants. */
+
+  static double pi = 3.1415926535897932e+00, /* 0x400921fb54442d18 */
+             piby2 = 1.5707963267948966e+00, /* 0x3ff921fb54442d18 */
+             piby4 = 7.8539816339744831e-01, /* 0x3fe921fb54442d18 */
+       three_piby4 = 2.3561944901923449e+00, /* 0x4002d97c7f3321d2 */
+           pi_head = 3.1415926218032836e+00, /* 0x400921fb50000000 */
+           pi_tail = 3.1786509547056392e-08, /* 0x3e6110b4611a6263 */
+        piby2_head = 1.5707963267948965e+00, /* 0x3ff921fb54442d18 */
+        piby2_tail = 6.1232339957367660e-17; /* 0x3c91a62633145c07 */
+
+  double u, v, vbyu, q1, q2, s, u1, vu1, u2, vu2, uu, c, r;
+  unsigned int swap_vu, index, xzero, yzero, xnan, ynan, xinf, yinf;
+  int m, xexp, yexp, diffexp;
+
+  /* Find properties of arguments x and y. */
+
+  unsigned long long ux, ui, aux, xneg, uy, auy, yneg;
+
+  GET_BITS_DP64(x, ux);
+  GET_BITS_DP64(y, uy);
+  aux = ux & ~SIGNBIT_DP64;
+  auy = uy & ~SIGNBIT_DP64;
+  xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+  yexp = (int)((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+  xneg = ux & SIGNBIT_DP64;
+  yneg = uy & SIGNBIT_DP64;
+  xzero = (aux == 0);
+  yzero = (auy == 0);
+  xnan = (aux > PINFBITPATT_DP64);
+  ynan = (auy > PINFBITPATT_DP64);
+  xinf = (aux == PINFBITPATT_DP64);
+  yinf = (auy == PINFBITPATT_DP64);
+
+  diffexp = yexp - xexp;
+
+  /* Special cases */
+
+  if (xnan)
+#ifdef WINDOWS
+    return handle_error("atan2", ux|0x0008000000000000, _DOMAIN, 0,
+                        EDOM, x, y);
+#else
+    return x + x; /* Raise invalid if it's a signalling NaN */
+#endif
+  else if (ynan)
+#ifdef WINDOWS
+    return handle_error("atan2", uy|0x0008000000000000, _DOMAIN, 0,
+                        EDOM, x, y);
+#else
+    return y + y; /* Raise invalid if it's a signalling NaN */
+#endif
+  else if (yzero)
+    { /* Zero y gives +-0 for positive x
+         and +-pi for negative x */
+#ifndef WINDOWS
+      if ((_LIB_VERSION == _SVID_) && xzero)
+        /* Sigh - _SVID_ defines atan2(0,0) as a domain error */
+        return retval_errno_edom(x, y);
+      else
+#endif
+      if (xneg)
+	{
+	  if (yneg) return val_with_flags(-pi,AMD_F_INEXACT);
+          else return val_with_flags(pi,AMD_F_INEXACT);
+	}
+      else return y;
+    }
+  else if (xzero)
+    { /* Zero x gives +- pi/2
+         depending on sign of y */
+      if (yneg) return val_with_flags(-piby2,AMD_F_INEXACT);
+      else val_with_flags(piby2,AMD_F_INEXACT);
+    }
+
+  /* Scale up both x and y if they are both below 1/4.
+     This avoids any possible later denormalised arithmetic. */
+
+  if ((xexp < 1021 && yexp < 1021))
+    {
+      scaleUpDouble1024(ux, &ux);
+      scaleUpDouble1024(uy, &uy);
+      PUT_BITS_DP64(ux, x);
+      PUT_BITS_DP64(uy, y);
+      xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+      yexp = (int)((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+      diffexp = yexp - xexp;
+    }
+
+  if (diffexp > 56)
+    { /* abs(y)/abs(x) > 2^56 => arctan(x/y)
+         is insignificant compared to piby2 */
+      if (yneg) return val_with_flags(-piby2,AMD_F_INEXACT);
+      else return val_with_flags(piby2,AMD_F_INEXACT);
+    }
+  else if (diffexp < -28 && (!xneg))
+    { /* x positive and dominant over y by a factor of 2^28.
+         In this case atan(y/x) is y/x to machine accuracy. */
+
+      if (diffexp < -1074) /* Result underflows */
+        {
+          if (yneg)
+            return val_with_flags(-0.0,AMD_F_INEXACT | AMD_F_UNDERFLOW);
+          else
+            return val_with_flags(0.0,AMD_F_INEXACT | AMD_F_UNDERFLOW);
+        }
+      else
+        {
+          if (diffexp < -1022)
+            {
+              /* Result will likely be denormalized */
+              y = scaleDouble_1(y, 100);
+              y /= x;
+              /* Now y is 2^100 times the true result. Scale it back down. */
+              GET_BITS_DP64(y, uy);
+	      scaleDownDouble(uy, 100, &uy);
+              PUT_BITS_DP64(uy, y);
+	      if ((uy & EXPBITS_DP64) == 0)
+		return val_with_flags(y, AMD_F_INEXACT | AMD_F_UNDERFLOW);
+	      else
+		return y;
+             }
+          else
+            return y / x;
+        }
+    }
+  else if (diffexp < -56 && xneg)
+    { /* abs(x)/abs(y) > 2^56 and x < 0 => arctan(y/x)
+         is insignificant compared to pi */
+    if (yneg) return val_with_flags(-pi,AMD_F_INEXACT);
+    else return val_with_flags(pi,AMD_F_INEXACT);
+    }
+  else if (yinf && xinf)
+    { /* If abs(x) and abs(y) are both infinity
+         return +-pi/4 or +- 3pi/4 according to
+         signs.  */
+    if (xneg)
+      {
+      if (yneg) return val_with_flags(-three_piby4,AMD_F_INEXACT);
+      else return val_with_flags(three_piby4,AMD_F_INEXACT);
+      }
+    else
+      {
+      if (yneg) return val_with_flags(-piby4,AMD_F_INEXACT);
+      else return val_with_flags(piby4,AMD_F_INEXACT);
+      }
+    }
+
+  /* General case: take absolute values of arguments */
+
+  u = x; v = y;
+  if (xneg) u = -x;
+  if (yneg) v = -y;
+
+  /* Swap u and v if necessary to obtain 0 < v < u. Compute v/u. */
+
+  swap_vu = (u < v);
+  if (swap_vu) { uu = u; u = v; v = uu; }
+  vbyu = v/u;
+
+  if (vbyu > 0.0625)
+    { /* General values of v/u. Use a look-up
+         table and series expansion. */
+
+      index = (int)(256*vbyu + 0.5);
+      q1 = atan_jby256_lead[index-16];
+      q2 = atan_jby256_tail[index-16];
+      c = index*1./256;
+      GET_BITS_DP64(u, ui);
+      m = (int)((ui & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+      u = scaleDouble_2(u,-m);
+      v = scaleDouble_2(v,-m);
+      GET_BITS_DP64(u, ui);
+      PUT_BITS_DP64(0xfffffffff8000000 & ui, u1); /* 26 leading bits of u */
+      u2 = u - u1;
+
+      r = ((v-c*u1)-c*u2)/(u+c*v);
+
+      /* Polynomial approximation to atan(r) */
+
+      s = r*r;
+      q2 = q2 + r - r*(s * (0.33333333333224095522 - s*(0.19999918038989143496)));
+    }
+  else if (vbyu < 1.e-8)
+    { /* v/u is small enough that atan(v/u) = v/u */
+      q1 = 0.0;
+      q2 = vbyu;
+    }
+  else  /* vbyu <= 0.0625 */
+    {
+      /* Small values of v/u. Use a series expansion
+	 computed carefully to minimise cancellation */
+
+      GET_BITS_DP64(u, ui);
+      PUT_BITS_DP64(0xffffffff00000000 & ui, u1);
+      GET_BITS_DP64(vbyu, ui);
+      PUT_BITS_DP64(0xffffffff00000000 & ui, vu1);
+      u2 = u - u1;
+      vu2 = vbyu - vu1;
+
+      q1 = 0.0;
+      s  = vbyu*vbyu;
+      q2 = vbyu +
+	((((v - u1*vu1) - u2*vu1) - u*vu2)/u -
+	 (vbyu*s*(0.33333333333333170500 -
+		  s*(0.19999999999393223405 -
+		     s*(0.14285713561807169030 -
+			s*(0.11110736283514525407 -
+			   s*(0.90029810285449784439E-01)))))));
+    }
+
+  /* Tidy-up according to which quadrant the arguments lie in */
+
+  if (swap_vu) {q1 = piby2_head - q1; q2 = piby2_tail - q2;}
+  if (xneg) {q1 = pi_head - q1; q2 = pi_tail - q2;}
+  q1 = q1 + q2;
+
+  if (yneg) q1 = - q1;
+
+  return q1;
+}
+
+weak_alias (__atan2, atan2)

diff --git a/src/atan2f.c b/src/atan2f.c
new file mode 100644
index 0000000..9b53c6f
--- /dev/null
+++ b/src/atan2f.c

@@ -0,0 +1,500 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VALF_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_SCALEDOUBLE_1
+#define USE_SCALEDOWNDOUBLE
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_VALF_WITH_FLAGS
+#undef USE_NAN_WITH_FLAGS
+#undef USE_SCALEDOUBLE_1
+#undef USE_SCALEDOWNDOUBLE
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range arguments
+   (only used when _LIB_VERSION is _SVID_) */
+static inline float retval_errno_edom(float x, float y)
+{
+  struct exception exc;
+  exc.arg1 = (double)x;
+  exc.arg2 = (double)y;
+  exc.type = DOMAIN;
+  exc.name = (char *)"atan2f";
+  exc.retval = HUGE;
+  if (!matherr(&exc))
+    {
+      (void)fputs("atan2f: DOMAIN error\n", stderr);
+      __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(atan2f)
+#endif
+
+float FN_PROTOTYPE(atan2f)(float fy, float fx)
+{
+  /* Array atan_jby256 contains precomputed values of atan(j/256),
+     for j = 16, 17, ..., 256. */
+
+  static const double atan_jby256[  241] = {
+    6.24188099959573430842e-02,  /* 0x3faff55bb72cfde9 */
+    6.63088949198234745008e-02,  /* 0x3fb0f99ea71d52a6 */
+    7.01969710718705064423e-02,  /* 0x3fb1f86dbf082d58 */
+    7.40829225490337306415e-02,  /* 0x3fb2f719318a4a9a */
+    7.79666338315423007588e-02,  /* 0x3fb3f59f0e7c559d */
+    8.18479898030765457007e-02,  /* 0x3fb4f3fd677292fb */
+    8.57268757707448092464e-02,  /* 0x3fb5f2324fd2d7b2 */
+    8.96031774848717321724e-02,  /* 0x3fb6f03bdcea4b0c */
+    9.34767811585894559112e-02,  /* 0x3fb7ee182602f10e */
+    9.73475734872236708739e-02,  /* 0x3fb8ebc54478fb28 */
+    1.01215441667466668485e-01,  /* 0x3fb9e94153cfdcf1 */
+    1.05080273416329528224e-01,  /* 0x3fbae68a71c722b8 */
+    1.08941956989865793015e-01,  /* 0x3fbbe39ebe6f07c3 */
+    1.12800381201659388752e-01,  /* 0x3fbce07c5c3cca32 */
+    1.16655435441069349478e-01,  /* 0x3fbddd21701eba6e */
+    1.20507009691224548087e-01,  /* 0x3fbed98c2190043a */
+    1.24354994546761424279e-01,  /* 0x3fbfd5ba9aac2f6d */
+    1.28199281231298117811e-01,  /* 0x3fc068d584212b3d */
+    1.32039761614638734288e-01,  /* 0x3fc0e6adccf40881 */
+    1.35876328229701304195e-01,  /* 0x3fc1646541060850 */
+    1.39708874289163620386e-01,  /* 0x3fc1e1fafb043726 */
+    1.43537293701821222491e-01,  /* 0x3fc25f6e171a535c */
+    1.47361481088651630200e-01,  /* 0x3fc2dcbdb2fba1ff */
+    1.51181331798580037562e-01,  /* 0x3fc359e8edeb99a3 */
+    1.54996741923940972718e-01,  /* 0x3fc3d6eee8c6626c */
+    1.58807608315631065832e-01,  /* 0x3fc453cec6092a9e */
+    1.62613828597948567589e-01,  /* 0x3fc4d087a9da4f17 */
+    1.66415301183114927586e-01,  /* 0x3fc54d18ba11570a */
+    1.70211925285474380276e-01,  /* 0x3fc5c9811e3ec269 */
+    1.74003600935367680469e-01,  /* 0x3fc645bfffb3aa73 */
+    1.77790228992676047071e-01,  /* 0x3fc6c1d4898933d8 */
+    1.81571711160032150945e-01,  /* 0x3fc73dbde8a7d201 */
+    1.85347949995694760705e-01,  /* 0x3fc7b97b4bce5b02 */
+    1.89118848926083965578e-01,  /* 0x3fc8350be398ebc7 */
+    1.92884312257974643856e-01,  /* 0x3fc8b06ee2879c28 */
+    1.96644245190344985064e-01,  /* 0x3fc92ba37d050271 */
+    2.00398553825878511514e-01,  /* 0x3fc9a6a8e96c8626 */
+    2.04147145182116990236e-01,  /* 0x3fca217e601081a5 */
+    2.07889927202262986272e-01,  /* 0x3fca9c231b403279 */
+    2.11626808765629753628e-01,  /* 0x3fcb1696574d780b */
+    2.15357699697738047551e-01,  /* 0x3fcb90d7529260a2 */
+    2.19082510780057748701e-01,  /* 0x3fcc0ae54d768466 */
+    2.22801153759394493514e-01,  /* 0x3fcc84bf8a742e6d */
+    2.26513541356919617664e-01,  /* 0x3fccfe654e1d5395 */
+    2.30219587276843717927e-01,  /* 0x3fcd77d5df205736 */
+    2.33919206214733416127e-01,  /* 0x3fcdf110864c9d9d */
+    2.37612313865471241892e-01,  /* 0x3fce6a148e96ec4d */
+    2.41298826930858800743e-01,  /* 0x3fcee2e1451d980c */
+    2.44978663126864143473e-01,  /* 0x3fcf5b75f92c80dd */
+    2.48651741190513253521e-01,  /* 0x3fcfd3d1fc40dbe4 */
+    2.52317980886427151166e-01,  /* 0x3fd025fa510665b5 */
+    2.55977303013005474952e-01,  /* 0x3fd061eea03d6290 */
+    2.59629629408257511791e-01,  /* 0x3fd09dc597d86362 */
+    2.63274882955282396590e-01,  /* 0x3fd0d97ee509acb3 */
+    2.66912987587400396539e-01,  /* 0x3fd1151a362431c9 */
+    2.70543868292936529052e-01,  /* 0x3fd150973a9ce546 */
+    2.74167451119658789338e-01,  /* 0x3fd18bf5a30bf178 */
+    2.77783663178873208022e-01,  /* 0x3fd1c735212dd883 */
+    2.81392432649178403370e-01,  /* 0x3fd2025567e47c95 */
+    2.84993688779881237938e-01,  /* 0x3fd23d562b381041 */
+    2.88587361894077354396e-01,  /* 0x3fd278372057ef45 */
+    2.92173383391398755471e-01,  /* 0x3fd2b2f7fd9b5fe2 */
+    2.95751685750431536626e-01,  /* 0x3fd2ed987a823cfe */
+    2.99322202530807379706e-01,  /* 0x3fd328184fb58951 */
+    3.02884868374971361060e-01,  /* 0x3fd362773707ebcb */
+    3.06439619009630070945e-01,  /* 0x3fd39cb4eb76157b */
+    3.09986391246883430384e-01,  /* 0x3fd3d6d129271134 */
+    3.13525122985043869228e-01,  /* 0x3fd410cbad6c7d32 */
+    3.17055753209146973237e-01,  /* 0x3fd44aa436c2af09 */
+    3.20578221991156986359e-01,  /* 0x3fd4845a84d0c21b */
+    3.24092470489871664618e-01,  /* 0x3fd4bdee586890e6 */
+    3.27598440950530811477e-01,  /* 0x3fd4f75f73869978 */
+    3.31096076704132047386e-01,  /* 0x3fd530ad9951cd49 */
+    3.34585322166458920545e-01,  /* 0x3fd569d88e1b4cd7 */
+    3.38066122836825466713e-01,  /* 0x3fd5a2e0175e0f4e */
+    3.41538425296541714449e-01,  /* 0x3fd5dbc3fbbe768d */
+    3.45002177207105076295e-01,  /* 0x3fd614840309cfe1 */
+    3.48457327308122011278e-01,  /* 0x3fd64d1ff635c1c5 */
+    3.51903825414964732676e-01,  /* 0x3fd685979f5fa6fd */
+    3.55341622416168290144e-01,  /* 0x3fd6bdeac9cbd76c */
+    3.58770670270572189509e-01,  /* 0x3fd6f61941e4def0 */
+    3.62190922004212156882e-01,  /* 0x3fd72e22d53aa2a9 */
+    3.65602331706966821034e-01,  /* 0x3fd7660752817501 */
+    3.69004854528964421068e-01,  /* 0x3fd79dc6899118d1 */
+    3.72398446676754202311e-01,  /* 0x3fd7d5604b63b3f7 */
+    3.75783065409248884237e-01,  /* 0x3fd80cd46a14b1d0 */
+    3.79158669033441808605e-01,  /* 0x3fd84422b8df95d7 */
+    3.82525216899905096124e-01,  /* 0x3fd87b4b0c1ebedb */
+    3.85882669398073752109e-01,  /* 0x3fd8b24d394a1b25 */
+    3.89230987951320717144e-01,  /* 0x3fd8e92916f5cde8 */
+    3.92570135011828580396e-01,  /* 0x3fd91fde7cd0c662 */
+    3.95900074055262896078e-01,  /* 0x3fd9566d43a34907 */
+    3.99220769575252543149e-01,  /* 0x3fd98cd5454d6b18 */
+    4.02532187077682512832e-01,  /* 0x3fd9c3165cc58107 */
+    4.05834293074804064450e-01,  /* 0x3fd9f93066168001 */
+    4.09127055079168300278e-01,  /* 0x3fda2f233e5e530b */
+    4.12410441597387267265e-01,  /* 0x3fda64eec3cc23fc */
+    4.15684422123729413467e-01,  /* 0x3fda9a92d59e98cf */
+    4.18948967133552840902e-01,  /* 0x3fdad00f5422058b */
+    4.22204048076583571270e-01,  /* 0x3fdb056420ae9343 */
+    4.25449637370042266227e-01,  /* 0x3fdb3a911da65c6c */
+    4.28685708391625730496e-01,  /* 0x3fdb6f962e737efb */
+    4.31912235472348193799e-01,  /* 0x3fdba473378624a5 */
+    4.35129193889246812521e-01,  /* 0x3fdbd9281e528191 */
+    4.38336559857957774877e-01,  /* 0x3fdc0db4c94ec9ef */
+    4.41534310525166673322e-01,  /* 0x3fdc42191ff11eb6 */
+    4.44722423960939305942e-01,  /* 0x3fdc76550aad71f8 */
+    4.47900879150937292206e-01,  /* 0x3fdcaa6872f3631b */
+    4.51069655988523443568e-01,  /* 0x3fdcde53432c1350 */
+    4.54228735266762495559e-01,  /* 0x3fdd121566b7f2ad */
+    4.57378098670320809571e-01,  /* 0x3fdd45aec9ec862b */
+    4.60517728767271039558e-01,  /* 0x3fdd791f5a1226f4 */
+    4.63647609000806093515e-01,  /* 0x3fddac670561bb4f */
+    4.66767723680866497560e-01,  /* 0x3fdddf85bb026974 */
+    4.69878057975686880265e-01,  /* 0x3fde127b6b0744af */
+    4.72978597903265574054e-01,  /* 0x3fde4548066cf51a */
+    4.76069330322761219421e-01,  /* 0x3fde77eb7f175a34 */
+    4.79150242925822533735e-01,  /* 0x3fdeaa65c7cf28c4 */
+    4.82221324227853687105e-01,  /* 0x3fdedcb6d43f8434 */
+    4.85282563559221225002e-01,  /* 0x3fdf0ede98f393cf */
+    4.88333951056405479729e-01,  /* 0x3fdf40dd0b541417 */
+    4.91375477653101910835e-01,  /* 0x3fdf72b221a4e495 */
+    4.94407135071275316562e-01,  /* 0x3fdfa45dd3029258 */
+    4.97428915812172245392e-01,  /* 0x3fdfd5e0175fdf83 */
+    5.00440813147294050189e-01,  /* 0x3fe0039c73c1a40b */
+    5.03442821109336358099e-01,  /* 0x3fe01c341e82422d */
+    5.06434934483096732549e-01,  /* 0x3fe034b709250488 */
+    5.09417148796356245022e-01,  /* 0x3fe04d25314342e5 */
+    5.12389460310737621107e-01,  /* 0x3fe0657e94db30cf */
+    5.15351866012543347040e-01,  /* 0x3fe07dc3324e9b38 */
+    5.18304363603577900044e-01,  /* 0x3fe095f30861a58f */
+    5.21246951491958210312e-01,  /* 0x3fe0ae0e1639866c */
+    5.24179628782913242802e-01,  /* 0x3fe0c6145b5b43da */
+    5.27102395269579471204e-01,  /* 0x3fe0de05d7aa6f7c */
+    5.30015251423793132268e-01,  /* 0x3fe0f5e28b67e295 */
+    5.32918198386882147055e-01,  /* 0x3fe10daa77307a0d */
+    5.35811237960463593311e-01,  /* 0x3fe1255d9bfbd2a8 */
+    5.38694372597246617929e-01,  /* 0x3fe13cfbfb1b056e */
+    5.41567605391844897333e-01,  /* 0x3fe1548596376469 */
+    5.44430940071603086672e-01,  /* 0x3fe16bfa6f5137e1 */
+    5.47284380987436924748e-01,  /* 0x3fe1835a88be7c13 */
+    5.50127933104692989907e-01,  /* 0x3fe19aa5e5299f99 */
+    5.52961601994028217888e-01,  /* 0x3fe1b1dc87904284 */
+    5.55785393822313511514e-01,  /* 0x3fe1c8fe7341f64f */
+    5.58599315343562330405e-01,  /* 0x3fe1e00babdefeb3 */
+    5.61403373889889367732e-01,  /* 0x3fe1f7043557138a */
+    5.64197577362497537656e-01,  /* 0x3fe20de813e823b1 */
+    5.66981934222700489912e-01,  /* 0x3fe224b74c1d192a */
+    5.69756453482978431069e-01,  /* 0x3fe23b71e2cc9e6a */
+    5.72521144698072359525e-01,  /* 0x3fe25217dd17e501 */
+    5.75276017956117824426e-01,  /* 0x3fe268a940696da6 */
+    5.78021083869819540801e-01,  /* 0x3fe27f261273d1b3 */
+    5.80756353567670302596e-01,  /* 0x3fe2958e59308e30 */
+    5.83481838685214859730e-01,  /* 0x3fe2abe21aded073 */
+    5.86197551356360535557e-01,  /* 0x3fe2c2215e024465 */
+    5.88903504204738026395e-01,  /* 0x3fe2d84c2961e48b */
+    5.91599710335111383941e-01,  /* 0x3fe2ee628406cbca */
+    5.94286183324841177367e-01,  /* 0x3fe30464753b090a */
+    5.96962937215401501234e-01,  /* 0x3fe31a52048874be */
+    5.99629986503951384336e-01,  /* 0x3fe3302b39b78856 */
+    6.02287346134964152178e-01,  /* 0x3fe345f01cce37bb */
+    6.04935031491913965951e-01,  /* 0x3fe35ba0b60eccce */
+    6.07573058389022313541e-01,  /* 0x3fe3713d0df6c503 */
+    6.10201443063065118722e-01,  /* 0x3fe386c52d3db11e */
+    6.12820202165241245673e-01,  /* 0x3fe39c391cd41719 */
+    6.15429352753104952356e-01,  /* 0x3fe3b198e5e2564a */
+    6.18028912282561737612e-01,  /* 0x3fe3c6e491c78dc4 */
+    6.20618898599929469384e-01,  /* 0x3fe3dc1c2a188504 */
+    6.23199329934065904268e-01,  /* 0x3fe3f13fb89e96f4 */
+    6.25770224888563042498e-01,  /* 0x3fe4064f47569f48 */
+    6.28331602434009650615e-01,  /* 0x3fe41b4ae06fea41 */
+    6.30883481900321840818e-01,  /* 0x3fe430328e4b26d5 */
+    6.33425882969144482537e-01,  /* 0x3fe445065b795b55 */
+    6.35958825666321447834e-01,  /* 0x3fe459c652badc7f */
+    6.38482330354437466191e-01,  /* 0x3fe46e727efe4715 */
+    6.40996417725432032775e-01,  /* 0x3fe4830aeb5f7bfd */
+    6.43501108793284370968e-01,  /* 0x3fe4978fa3269ee1 */
+    6.45996424886771558604e-01,  /* 0x3fe4ac00b1c71762 */
+    6.48482387642300484032e-01,  /* 0x3fe4c05e22de94e4 */
+    6.50959018996812410762e-01,  /* 0x3fe4d4a8023414e8 */
+    6.53426341180761927063e-01,  /* 0x3fe4e8de5bb6ec04 */
+    6.55884376711170835605e-01,  /* 0x3fe4fd013b7dd17e */
+    6.58333148384755983962e-01,  /* 0x3fe51110adc5ed81 */
+    6.60772679271132590273e-01,  /* 0x3fe5250cbef1e9fa */
+    6.63202992706093175102e-01,  /* 0x3fe538f57b89061e */
+    6.65624112284960989250e-01,  /* 0x3fe54ccaf0362c8f */
+    6.68036061856020157990e-01,  /* 0x3fe5608d29c70c34 */
+    6.70438865514021320458e-01,  /* 0x3fe5743c352b33b9 */
+    6.72832547593763097282e-01,  /* 0x3fe587d81f732fba */
+    6.75217132663749830535e-01,  /* 0x3fe59b60f5cfab9d */
+    6.77592645519925151909e-01,  /* 0x3fe5aed6c5909517 */
+    6.79959111179481823228e-01,  /* 0x3fe5c2399c244260 */
+    6.82316554874748071313e-01,  /* 0x3fe5d58987169b18 */
+    6.84665002047148862907e-01,  /* 0x3fe5e8c6941043cf */
+    6.87004478341244895212e-01,  /* 0x3fe5fbf0d0d5cc49 */
+    6.89335009598845749323e-01,  /* 0x3fe60f084b46e05e */
+    6.91656621853199760075e-01,  /* 0x3fe6220d115d7b8d */
+    6.93969341323259825138e-01,  /* 0x3fe634ff312d1f3b */
+    6.96273194408023488045e-01,  /* 0x3fe647deb8e20b8f */
+    6.98568207680949848637e-01,  /* 0x3fe65aabb6c07b02 */
+    7.00854407884450081312e-01,  /* 0x3fe66d663923e086 */
+    7.03131821924453670469e-01,  /* 0x3fe6800e4e7e2857 */
+    7.05400476865049030906e-01,  /* 0x3fe692a40556fb6a */
+    7.07660399923197958039e-01,  /* 0x3fe6a5276c4b0575 */
+    7.09911618463524796141e-01,  /* 0x3fe6b798920b3d98 */
+    7.12154159993178659249e-01,  /* 0x3fe6c9f7855c3198 */
+    7.14388052156768926793e-01,  /* 0x3fe6dc44551553ae */
+    7.16613322731374569052e-01,  /* 0x3fe6ee7f10204aef */
+    7.18829999621624415873e-01,  /* 0x3fe700a7c5784633 */
+    7.21038110854851588272e-01,  /* 0x3fe712be84295198 */
+    7.23237684576317874097e-01,  /* 0x3fe724c35b4fae7b */
+    7.25428749044510712274e-01,  /* 0x3fe736b65a172dff */
+    7.27611332626510676214e-01,  /* 0x3fe748978fba8e0f */
+    7.29785463793429123314e-01,  /* 0x3fe75a670b82d8d8 */
+    7.31951171115916565668e-01,  /* 0x3fe76c24dcc6c6c0 */
+    7.34108483259739652560e-01,  /* 0x3fe77dd112ea22c7 */
+    7.36257428981428097003e-01,  /* 0x3fe78f6bbd5d315e */
+    7.38398037123989547936e-01,  /* 0x3fe7a0f4eb9c19a2 */
+    7.40530336612692630105e-01,  /* 0x3fe7b26cad2e50fd */
+    7.42654356450917929600e-01,  /* 0x3fe7c3d311a6092b */
+    7.44770125716075148681e-01,  /* 0x3fe7d528289fa093 */
+    7.46877673555587429099e-01,  /* 0x3fe7e66c01c114fd */
+    7.48977029182941400620e-01,  /* 0x3fe7f79eacb97898 */
+    7.51068221873802288613e-01,  /* 0x3fe808c03940694a */
+    7.53151280962194302759e-01,  /* 0x3fe819d0b7158a4c */
+    7.55226235836744863583e-01,  /* 0x3fe82ad036000005 */
+    7.57293115936992444759e-01,  /* 0x3fe83bbec5cdee22 */
+    7.59351950749757920178e-01,  /* 0x3fe84c9c7653f7ea */
+    7.61402769805578416573e-01,  /* 0x3fe85d69576cc2c5 */
+    7.63445602675201784315e-01,  /* 0x3fe86e2578f87ae5 */
+    7.65480478966144461950e-01,  /* 0x3fe87ed0eadc5a2a */
+    7.67507428319308182552e-01,  /* 0x3fe88f6bbd023118 */
+    7.69526480405658186434e-01,  /* 0x3fe89ff5ff57f1f7 */
+    7.71537664922959498526e-01,  /* 0x3fe8b06fc1cf3dfe */
+    7.73541011592573490852e-01,  /* 0x3fe8c0d9145cf49d */
+    7.75536550156311621507e-01,  /* 0x3fe8d13206f8c4ca */
+    7.77524310373347682379e-01,  /* 0x3fe8e17aa99cc05d */
+    7.79504322017186335181e-01,  /* 0x3fe8f1b30c44f167 */
+    7.81476614872688268854e-01,  /* 0x3fe901db3eeef187 */
+    7.83441218733151756304e-01,  /* 0x3fe911f35199833b */
+    7.85398163397448278999e-01}; /* 0x3fe921fb54442d18 */
+
+  /* Some constants. */
+
+  static double pi = 3.1415926535897932e+00, /* 0x400921fb54442d18 */
+             piby2 = 1.5707963267948966e+00, /* 0x3ff921fb54442d18 */
+             piby4 = 7.8539816339744831e-01, /* 0x3fe921fb54442d18 */
+       three_piby4 = 2.3561944901923449e+00; /* 0x4002d97c7f3321d2 */
+
+  double u, v, vbyu, q, s, uu, r;
+  unsigned int swap_vu, index, xzero, yzero, xnan, ynan, xinf, yinf;
+  int xexp, yexp, diffexp;
+
+  double x = fx;
+  double y = fy;
+
+  /* Find properties of arguments x and y. */
+
+  unsigned long long ux, aux, xneg, uy, auy, yneg;
+
+  GET_BITS_DP64(x, ux);
+  GET_BITS_DP64(y, uy);
+  aux = ux & ~SIGNBIT_DP64;
+  auy = uy & ~SIGNBIT_DP64;
+  xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+  yexp = (int)((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+  xneg = ux & SIGNBIT_DP64;
+  yneg = uy & SIGNBIT_DP64;
+  xzero = (aux == 0);
+  yzero = (auy == 0);
+  xnan = (aux > PINFBITPATT_DP64);
+  ynan = (auy > PINFBITPATT_DP64);
+  xinf = (aux == PINFBITPATT_DP64);
+  yinf = (auy == PINFBITPATT_DP64);
+
+  diffexp = yexp - xexp;
+
+  /* Special cases */
+
+  if (xnan)
+#ifdef WINDOWS
+    {
+      unsigned int ufx;
+      GET_BITS_SP32(fx, ufx);
+      return handle_errorf("atan2f", ufx|0x00400000, _DOMAIN, 0, EDOM, fx, fy);
+    }
+#else
+    return fx + fx; /* Raise invalid if it's a signalling NaN */
+#endif
+  else if (ynan)
+#ifdef WINDOWS
+    {
+      unsigned int ufy;
+      GET_BITS_SP32(fy, ufy);
+      return handle_errorf("atan2f", ufy|0x00400000, _DOMAIN, 0, EDOM, fx, fy);
+    }
+#else
+    return (float)(y + y); /* Raise invalid if it's a signalling NaN */
+#endif
+  else if (yzero)
+    { /* Zero y gives +-0 for positive x
+         and +-pi for negative x */
+#ifndef WINDOWS
+      if ((_LIB_VERSION == _SVID_) && xzero)
+        /* Sigh - _SVID_ defines atan2(0,0) as a domain error */
+        return retval_errno_edom(x, y);
+      else
+#endif
+      if (xneg)
+	{
+	  if (yneg) return valf_with_flags((float)-pi, AMD_F_INEXACT);
+          else return valf_with_flags((float)pi, AMD_F_INEXACT);
+	}
+      else return (float)y;
+    }
+  else if (xzero)
+    { /* Zero x gives +- pi/2
+         depending on sign of y */
+      if (yneg) return valf_with_flags((float)-piby2, AMD_F_INEXACT);
+      else valf_with_flags((float)piby2, AMD_F_INEXACT);
+    }
+
+  if (diffexp > 26)
+    { /* abs(y)/abs(x) > 2^26 => arctan(x/y)
+         is insignificant compared to piby2 */
+      if (yneg) return valf_with_flags((float)-piby2, AMD_F_INEXACT);
+      else return valf_with_flags((float)piby2, AMD_F_INEXACT);
+    }
+  else if (diffexp < -13 && (!xneg))
+    { /* x positive and dominant over y by a factor of 2^13.
+         In this case atan(y/x) is y/x to machine accuracy. */
+
+      if (diffexp < -150) /* Result underflows */
+        {
+          if (yneg)
+            return valf_with_flags(-0.0F, AMD_F_INEXACT | AMD_F_UNDERFLOW);
+          else
+            return valf_with_flags(0.0F, AMD_F_INEXACT | AMD_F_UNDERFLOW);
+        }
+      else
+        {
+          if (diffexp < -126)
+            {
+              /* Result will likely be denormalized */
+              y = scaleDouble_1(y, 100);
+              y /= x;
+              /* Now y is 2^100 times the true result. Scale it back down. */
+              GET_BITS_DP64(y, uy);
+	      scaleDownDouble(uy, 100, &uy);
+              PUT_BITS_DP64(uy, y);
+	      if ((uy & EXPBITS_DP64) == 0)
+		return valf_with_flags((float)y, AMD_F_INEXACT | AMD_F_UNDERFLOW);
+	      else
+		return (float)y;
+             }
+          else
+            return (float)(y / x);
+        }
+    }
+  else if (diffexp < -26 && xneg)
+    { /* abs(x)/abs(y) > 2^56 and x < 0 => arctan(y/x)
+         is insignificant compared to pi */
+    if (yneg) return valf_with_flags((float)-pi, AMD_F_INEXACT);
+    else return valf_with_flags((float)pi, AMD_F_INEXACT);
+    }
+  else if (yinf && xinf)
+    { /* If abs(x) and abs(y) are both infinity
+         return +-pi/4 or +- 3pi/4 according to
+         signs.  */
+    if (xneg)
+      {
+      if (yneg) return valf_with_flags((float)-three_piby4, AMD_F_INEXACT);
+      else return valf_with_flags((float)three_piby4, AMD_F_INEXACT);
+      }
+    else
+      {
+      if (yneg) return valf_with_flags((float)-piby4, AMD_F_INEXACT);
+      else return valf_with_flags((float)piby4, AMD_F_INEXACT);
+      }
+    }
+
+  /* General case: take absolute values of arguments */
+
+  u = x; v = y;
+  if (xneg) u = -x;
+  if (yneg) v = -y;
+
+  /* Swap u and v if necessary to obtain 0 < v < u. Compute v/u. */
+
+  swap_vu = (u < v);
+  if (swap_vu) { uu = u; u = v; v = uu; }
+  vbyu = v/u;
+
+  if (vbyu > 0.0625)
+    { /* General values of v/u. Use a look-up
+         table and series expansion. */
+
+      index = (int)(256*vbyu + 0.5);
+      r = (256*v-index*u)/(256*u+index*v);
+
+      /* Polynomial approximation to atan(vbyu) */
+
+      s = r*r;
+      q = atan_jby256[index-16] + r - r*s*0.33333333333224095522;
+    }
+  else if (vbyu < 1.e-4)
+    { /* v/u is small enough that atan(v/u) = v/u */
+      q = vbyu;
+    }
+  else /* vbyu <= 0.0625 */
+    {
+      /* Small values of v/u. Use a series expansion */
+
+      s  = vbyu*vbyu;
+      q = vbyu -
+	vbyu*s*(0.33333333333333170500 -
+		s*(0.19999999999393223405 -
+		   s*0.14285713561807169030));
+    }
+
+  /* Tidy-up according to which quadrant the arguments lie in */
+
+  if (swap_vu) {q = piby2 - q;}
+  if (xneg) {q = pi - q;}
+  if (yneg) q = - q;
+  return (float)q;
+}
+
+weak_alias (__atan2f, atan2f)

diff --git a/src/atanf.c b/src/atanf.c
new file mode 100644
index 0000000..567dd87
--- /dev/null
+++ b/src/atanf.c

@@ -0,0 +1,170 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VALF_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_VALF_WITH_FLAGS
+#undef USE_NAN_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x)
+{
+  struct exception exc;
+  exc.arg1 = (float)x;
+  exc.arg2 = (float)x;
+  exc.name = (char *)"atanf";
+  exc.type = DOMAIN;
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = HUGE;
+  else
+    exc.retval = nan_with_flags(AMD_F_INVALID);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("atanf: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(atanf)
+#endif
+
+float FN_PROTOTYPE(atanf)(float fx)
+{
+
+  /* Some constants and split constants. */
+
+  static double piby2 = 1.5707963267948966e+00; /* 0x3ff921fb54442d18 */
+
+  double c, v, s, q, z;
+  unsigned int xnan;
+
+  double x = fx;
+
+  /* Find properties of argument fx. */
+
+  unsigned long long ux, aux, xneg;
+
+  GET_BITS_DP64(x, ux);
+  aux = ux & ~SIGNBIT_DP64;
+  xneg = ux & SIGNBIT_DP64;
+
+  v = x;
+  if (xneg) v = -x;
+
+  /* Argument reduction to range [-7/16,7/16] */
+
+  if (aux < 0x3ec0000000000000) /* v < 2.0^(-19) */
+    {
+      /* x is a good approximation to atan(x) */
+      if (aux == 0x0000000000000000)
+        return fx;
+      else
+        return valf_with_flags(fx, AMD_F_INEXACT);
+    }
+  else if (aux < 0x3fdc000000000000) /* v < 7./16. */
+    {
+      x = v;
+      c = 0.0;
+    }
+  else if (aux < 0x3fe6000000000000) /* v < 11./16. */
+    {
+      x = (2.0*v-1.0)/(2.0+v);
+      /* c = arctan(0.5) */
+      c = 4.63647609000806093515e-01; /* 0x3fddac670561bb4f */
+    }
+  else if (aux < 0x3ff3000000000000) /* v < 19./16. */
+    {
+      x = (v-1.0)/(1.0+v);
+      /* c = arctan(1.) */
+      c = 7.85398163397448278999e-01; /* 0x3fe921fb54442d18 */
+    }
+  else if (aux < 0x4003800000000000) /* v < 39./16. */
+    {
+      x = (v-1.5)/(1.0+1.5*v);
+      /* c = arctan(1.5) */
+      c = 9.82793723247329054082e-01; /* 0x3fef730bd281f69b */
+    }
+  else
+    {
+
+      xnan = (aux > PINFBITPATT_DP64);
+
+      if (xnan)
+        {
+          /* x is NaN */
+#ifdef WINDOWS
+          unsigned int uhx;
+          GET_BITS_SP32(fx, uhx);
+          return handle_errorf("atanf", uhx|0x00400000, _DOMAIN,
+                               0, EDOM, fx, 0.0F);
+#else
+          return x + x; /* Raise invalid if it's a signalling NaN */
+#endif
+        }
+      else if (aux > 0x4190000000000000)
+	{ /* abs(x) > 2^26 => arctan(1/x) is
+	     insignificant compared to piby2 */
+	  if (xneg)
+            return valf_with_flags((float)-piby2, AMD_F_INEXACT);
+	  else
+            return valf_with_flags((float)piby2, AMD_F_INEXACT);
+	}
+
+      x = -1.0/v;
+      /* c = arctan(infinity) */
+      c = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */
+    }
+
+  /* Core approximation: Remez(2,2) on [-7/16,7/16] */
+
+  s = x*x;
+  q = x*s*
+    (0.296528598819239217902158651186e0 +
+     (0.192324546402108583211697690500e0 +
+       0.470677934286149214138357545549e-2*s)*s)/
+    (0.889585796862432286486651434570e0 +
+     (0.111072499995399550138837673349e1 +
+       0.299309699959659728404442796915e0*s)*s);
+
+  z = c - (q - x);
+
+  if (xneg) z = -z;
+  return (float)z;
+}
+
+weak_alias (__atanf, atanf)

diff --git a/src/atanh.c b/src/atanh.c
new file mode 100644
index 0000000..5815ced
--- /dev/null
+++ b/src/atanh.c

@@ -0,0 +1,193 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_NAN_WITH_FLAGS
+#define USE_VAL_WITH_FLAGS
+#define USE_INFINITY_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x, double retval)
+{
+  struct exception exc;
+  exc.arg1 = x;
+  exc.arg2 = x;
+  exc.type = DOMAIN;
+  exc.name = (char *)"atanh";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = -HUGE;
+  else
+    exc.retval = retval;
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("atanh: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "atanh"
+double FN_PROTOTYPE(atanh)(double x)
+{
+
+  unsigned long long ux, ax;
+  double r, absx, t, poly;
+
+
+  GET_BITS_DP64(x, ux);
+  ax = ux & ~SIGNBIT_DP64;
+  PUT_BITS_DP64(ax, absx);
+
+  if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_DP64)
+        {
+          /* x is NaN */
+#ifdef WINDOWS
+          return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN,
+                              AMD_F_INVALID, EDOM, x, 0.0);
+#else
+          return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+        }
+      else
+        {
+          /* x is infinity; return a NaN */
+#ifdef WINDOWS
+          return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+                              AMD_F_INVALID, EDOM, x, 0.0);
+#else
+          return retval_errno_edom(x,nan_with_flags(AMD_F_INVALID));
+#endif
+        }
+    }
+  else if (ax >= 0x3ff0000000000000)
+    {
+      if (ax > 0x3ff0000000000000)
+        {
+          /* abs(x) > 1.0; return NaN */
+#ifdef WINDOWS
+          return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+                              AMD_F_INVALID, EDOM, x, 0.0);
+#else
+          return retval_errno_edom(x,nan_with_flags(AMD_F_INVALID));
+#endif
+        }
+      else if (ux == 0x3ff0000000000000)
+        {
+          /* x = +1.0; return infinity with the same sign as x
+             and set the divbyzero status flag */
+#ifdef WINDOWS
+          return handle_error(_FUNCNAME, PINFBITPATT_DP64, _DOMAIN,
+                              AMD_F_INVALID, EDOM, x, 0.0);
+#else
+          return retval_errno_edom(x,infinity_with_flags(AMD_F_DIVBYZERO));
+#endif
+        }
+      else
+        {
+          /* x = -1.0; return infinity with the same sign as x */
+#ifdef WINDOWS
+          return handle_error(_FUNCNAME, NINFBITPATT_DP64, _DOMAIN,
+                              AMD_F_INVALID, EDOM, x, 0.0);
+#else
+          return retval_errno_edom(x,-infinity_with_flags(AMD_F_DIVBYZERO));
+#endif
+        }
+    }
+
+
+  if (ax < 0x3e30000000000000)
+    {
+      if (ax == 0x0000000000000000)
+        {
+          /* x is +/-zero. Return the same zero. */
+          return x;
+        }
+      else
+        {
+          /* Arguments smaller than 2^(-28) in magnitude are
+             approximated by atanh(x) = x, raising inexact flag. */
+          return val_with_flags(x, AMD_F_INEXACT);
+        }
+    }
+  else
+    {
+      if (ax < 0x3fe0000000000000)
+        {
+          /* Arguments up to 0.5 in magnitude are
+             approximated by a [5,5] minimax polynomial */
+          t = x*x;
+          poly =
+            (0.47482573589747356373e0 +
+             (-0.11028356797846341457e1 +
+              (0.88468142536501647470e0 +
+               (-0.28180210961780814148e0 +
+                (0.28728638600548514553e-1 -
+                 0.10468158892753136958e-3 * t) * t) * t) * t) * t) /
+            (0.14244772076924206909e1 +
+             (-0.41631933639693546274e1 +
+              (0.45414700626084508355e1 +
+               (-0.22608883748988489342e1 +
+                (0.49561196555503101989e0 -
+                 0.35861554370169537512e-1 * t) * t) * t) * t) * t);
+          return x + x*t*poly;
+        }
+      else
+        {
+          /* abs(x) >= 0.5 */
+          /* Note that
+               atanh(x) = 0.5 * ln((1+x)/(1-x))
+             (see Abramowitz and Stegun 4.6.22).
+             For greater accuracy we use the variant formula
+             atanh(x) = log(1 + 2x/(1-x)) = log1p(2x/(1-x)).
+          */
+          r = (2.0 * absx) / (1.0 - absx);
+          r = 0.5 * FN_PROTOTYPE(log1p)(r);
+          if (ux & SIGNBIT_DP64)
+            /* Argument x is negative */
+            return -r;
+          else
+            return r;
+        }
+    }
+}
+
+weak_alias (__atanh, atanh)

diff --git a/src/atanhf.c b/src/atanhf.c
new file mode 100644
index 0000000..38692b4
--- /dev/null
+++ b/src/atanhf.c

@@ -0,0 +1,194 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include <stdio.h>
+
+#define USE_NANF_WITH_FLAGS
+#define USE_VALF_WITH_FLAGS
+#define USE_INFINITYF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NANF_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+#undef USE_INFINITYF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x, float retval)
+{
+  struct exception exc;
+  exc.arg1 = (double)x;
+  exc.arg2 = (double)x;
+  exc.type = DOMAIN;
+  exc.name = (char *)"atanhf";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = -HUGE;
+  else
+    exc.retval = (double)retval;
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("atanhf: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "atanhf"
+float FN_PROTOTYPE(atanhf)(float x)
+{
+
+  double dx;
+  unsigned int ux, ax;
+  double r, t, poly;
+
+  GET_BITS_SP32(x, ux);
+  ax = ux & ~SIGNBIT_SP32;
+
+  if ((ux & EXPBITS_SP32) == EXPBITS_SP32)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_SP32)
+        {
+          /* x is NaN */
+#ifdef WINDOWS
+          return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN,
+                              0, EDOM, x, 0.0F);
+#else
+          return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+        }
+      else
+        {
+          /* x is infinity; return a NaN */
+#ifdef WINDOWS
+          return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+                               AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+          return retval_errno_edom(x,nanf_with_flags(AMD_F_INVALID));
+#endif
+        }
+    }
+  else if (ax >= 0x3f800000)
+    {
+      if (ax > 0x3f800000)
+        {
+          /* abs(x) > 1.0; return NaN */
+#ifdef WINDOWS
+          return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+                               AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+          return retval_errno_edom(x,nanf_with_flags(AMD_F_INVALID));
+#endif
+        }
+      else if (ux == 0x3f800000)
+        {
+          /* x = +1.0; return infinity with the same sign as x
+             and set the divbyzero status flag */
+#ifdef WINDOWS
+          return handle_errorf(_FUNCNAME, PINFBITPATT_SP32, _DOMAIN,
+                               AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+          return retval_errno_edom(x,infinityf_with_flags(AMD_F_DIVBYZERO));
+#endif
+        }
+      else
+        {
+          /* x = -1.0; return infinity with the same sign as x */
+#ifdef WINDOWS
+          return handle_errorf(_FUNCNAME, NINFBITPATT_SP32, _DOMAIN,
+                               AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+          return retval_errno_edom(x,-infinityf_with_flags(AMD_F_DIVBYZERO));
+#endif
+        }
+    }
+
+  if (ax < 0x39000000)
+    {
+      if (ax == 0x00000000)
+        {
+          /* x is +/-zero. Return the same zero. */
+          return x;
+        }
+      else
+        {
+          /* Arguments smaller than 2^(-13) in magnitude are
+             approximated by atanhf(x) = x, raising inexact flag. */
+          return valf_with_flags(x, AMD_F_INEXACT);
+        }
+    }
+  else
+    {
+      dx = x;
+      if (ax < 0x3f000000)
+        {
+          /* Arguments up to 0.5 in magnitude are
+             approximated by a [2,2] minimax polynomial */
+          t = dx*dx;
+          poly =
+            (0.39453629046e0 +
+           (-0.28120347286e0 +
+             0.92834212715e-2 * t) * t) /
+            (0.11836088638e1 + 
+           (-0.15537744551e1 +
+             0.45281890445e0 * t) * t);
+          return (float)(dx + dx*t*poly);
+        }
+      else
+        {
+          /* abs(x) >= 0.5 */
+          /* Note that
+               atanhf(x) = 0.5 * ln((1+x)/(1-x))
+             (see Abramowitz and Stegun 4.6.22).
+             For greater accuracy we use the variant formula
+             atanhf(x) = log(1 + 2x/(1-x)) = log1p(2x/(1-x)).
+          */
+          if (ux & SIGNBIT_SP32)
+            {
+              /* Argument x is negative */
+              r = (-2.0 * dx) / (1.0 + dx);
+              r = 0.5 * FN_PROTOTYPE(log1p)(r);
+              return (float)-r;
+            }
+          else
+            {
+              r = (2.0 * dx) / (1.0 - dx);
+              r = 0.5 * FN_PROTOTYPE(log1p)(r);
+              return (float)r;
+            }
+        }
+    }
+}
+
+weak_alias (__atanhf, atanhf)

diff --git a/src/ceil.c b/src/ceil.c
new file mode 100644
index 0000000..94ef21d
--- /dev/null
+++ b/src/ceil.c

@@ -0,0 +1,104 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERROR
+#endif
+
+#ifdef WINDOWS
+#pragma function(ceil)
+#endif
+
+double FN_PROTOTYPE(ceil)(double x)
+{
+  double r;
+  long long rexp, xneg;
+  unsigned long long ux, ax, ur, mask;
+
+  GET_BITS_DP64(x, ux);
+  /*ax is |x|*/
+  ax = ux & (~SIGNBIT_DP64);
+  /*xneg stores the sign of the input x*/
+  xneg = (ux != ax);
+  /*The range is divided into 
+    > 2^53. ceil will either the number itself or Nan
+            always returns a QNan. Raises exception if input is a SNan
+    < 1.0   If 0.0 then return with the appropriate sign
+            If input is less than -0.0 and greater than -1.0 then return -0.0
+            If input is greater than 0.0 and less than 1.0 then return 1.0
+    1.0 < |x| < 2^53 
+            appropriately check the exponent and set the return Value by shifting
+            */
+  if (ax >= 0x4340000000000000) /* abs(x) > 2^53*/
+    {
+      /* abs(x) is either NaN, infinity, or >= 2^53 */
+      if (ax > 0x7ff0000000000000)
+        /* x is NaN */
+#ifdef WINDOWS
+        return handle_error("ceil", ux|0x0008000000000000, _DOMAIN, 0,
+                            EDOM, x, 0.0);
+#else
+        return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+      else
+        return x;
+    }
+  else if (ax < 0x3ff0000000000000) /* abs(x) < 1.0 */
+    {
+      if (ax == 0x0000000000000000)
+        /* x is +zero or -zero; return the same zero */
+          return x;
+      else if (xneg) /* x < 0.0; return -0.0 */
+        {
+          PUT_BITS_DP64(0x8000000000000000, r);
+          return r;
+        }
+      else
+        return 1.0;
+    }
+  else
+    {
+      /*Get the exponent for the floating point number. Should be between 0 and 53*/
+      rexp = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+      /* Mask out the bits of r that we don't want */
+      mask = 1;
+      mask = (mask << (EXPSHIFTBITS_DP64 - rexp)) - 1;
+      /*Keeps the exponent part and the required mantissa.*/
+      ur = (ux & ~mask);
+      PUT_BITS_DP64(ur, r);
+      if (xneg || (ur == ux))
+        return r;
+      else
+        /* We threw some bits away and x was positive */
+        return r + 1.0;
+    }
+
+}
+
+weak_alias (__ceil, ceil)

diff --git a/src/ceilf.c b/src/ceilf.c
new file mode 100644
index 0000000..56d0c37
--- /dev/null
+++ b/src/ceilf.c

@@ -0,0 +1,97 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERRORF
+#endif
+
+#ifdef WINDOWS
+#pragma function(ceilf)
+#endif
+
+float FN_PROTOTYPE(ceilf)(float x)
+{
+  float r;
+  int rexp, xneg;
+  unsigned int ux, ax, ur, mask;
+
+  GET_BITS_SP32(x, ux);
+  /*ax is |x|*/
+  ax = ux & (~SIGNBIT_SP32);
+  /*xneg stores the sign of the input x*/
+  xneg = (ux != ax);
+  /*The range is divided into 
+    > 2^24. ceil will either the number itself or Nan
+            always returns a QNan. Raises exception if input is a SNan
+    < 1.0   If 0.0 then return with the appropriate sign
+            If input is less than -0.0 and greater than -1.0 then return -0.0
+            If input is greater than 0.0 and less than 1.0 then return 1.0
+    1.0 < |x| < 2^24 
+            appropriately check the exponent and set the return Value by shifting
+            */
+  if (ax >= 0x4b800000) /* abs(x) > 2^24*/
+    {
+      /* abs(x) is either NaN, infinity, or >= 2^24 */
+      if (ax > 0x7f800000)
+        /* x is NaN */
+#ifdef WINDOWS
+        return handle_errorf("ceilf", ux, _DOMAIN, 0, EDOM, x, 0.0F);
+#else
+        return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+      else
+        return x;
+    }
+  else if (ax < 0x3f800000) /* abs(x) < 1.0 */
+    {
+      if (ax == 0x00000000)
+        /* x is +zero or -zero; return the same zero */
+        return x;
+      else if (xneg) /* x < 0.0 */
+        return -0.0F;
+      else
+        return 1.0F;
+    }
+  else
+    {
+      rexp = ((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+      /* Mask out the bits of r that we don't want */
+      mask = (1 << (EXPSHIFTBITS_SP32 - rexp)) - 1;
+      /*Keeps the exponent part and the required mantissa.*/
+      ur = (ux & ~mask);
+      PUT_BITS_SP32(ur, r);
+
+      if (xneg || (ux == ur)) return r;
+      else
+        /* We threw some bits away and x was positive */
+        return r + 1.0F;
+    }
+}
+
+weak_alias (__ceilf, ceilf)

diff --git a/src/cosh.c b/src/cosh.c
new file mode 100644
index 0000000..6f8734b
--- /dev/null
+++ b/src/cosh.c

@@ -0,0 +1,359 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_SPLITEXP
+#define USE_SCALEDOUBLE_1
+#define USE_SCALEDOUBLE_2
+#define USE_INFINITY_WITH_FLAGS
+#define USE_VAL_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERROR
+#undef USE_SPLITEXP
+#undef USE_SCALEDOUBLE_1
+#undef USE_SCALEDOUBLE_2
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+
+#include "../inc/libm_errno_amd.h"
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline double retval_errno_erange(double x)
+{
+  struct exception exc;
+  exc.arg1 = x;
+  exc.arg2 = x;
+  exc.type = OVERFLOW;
+  exc.name = (char *)"cosh";
+  if (_LIB_VERSION == _SVID_)
+    {
+        exc.retval = HUGE;
+    }
+  else
+    {
+        exc.retval = infinity_with_flags(AMD_F_OVERFLOW);
+    }
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(ERANGE);
+  else if (!matherr(&exc))
+    __set_errno(ERANGE);
+  return exc.retval;
+}
+#endif
+
+double FN_PROTOTYPE(cosh)(double x)
+{
+  /*
+    Derived from sinh subroutine
+    
+    After dealing with special cases the computation is split into
+    regions as follows:
+
+    abs(x) >= max_cosh_arg:
+    cosh(x) = sign(x)*Inf
+
+    abs(x) >= small_threshold:
+    cosh(x) = sign(x)*exp(abs(x))/2 computed using the
+    splitexp and scaleDouble functions as for exp_amd().
+
+    abs(x) < small_threshold:
+    compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0)))
+    cosh(x) is then sign(x)*z.                             */
+
+  static const double
+    max_cosh_arg = 7.10475860073943977113e+02, /* 0x408633ce8fb9f87e */
+    thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */
+    log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */
+    log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */
+//    small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889;
+    small_threshold = 20.0;
+  /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */
+
+  /* Lead and tail tabulated values of sinh(i) and cosh(i) 
+     for i = 0,...,36. The lead part has 26 leading bits. */
+
+  static const double sinh_lead[   37] = {
+    0.00000000000000000000e+00,  /* 0x0000000000000000 */
+    1.17520117759704589844e+00,  /* 0x3ff2cd9fc0000000 */
+    3.62686038017272949219e+00,  /* 0x400d03cf60000000 */
+    1.00178747177124023438e+01,  /* 0x40240926e0000000 */
+    2.72899169921875000000e+01,  /* 0x403b4a3800000000 */
+    7.42032089233398437500e+01,  /* 0x40528d0160000000 */
+    2.01713153839111328125e+02,  /* 0x406936d228000000 */
+    5.48316116333007812500e+02,  /* 0x4081228768000000 */
+    1.49047882080078125000e+03,  /* 0x409749ea50000000 */
+    4.05154187011718750000e+03,  /* 0x40afa71570000000 */
+    1.10132326660156250000e+04,  /* 0x40c5829dc8000000 */
+    2.99370708007812500000e+04,  /* 0x40dd3c4488000000 */
+    8.13773945312500000000e+04,  /* 0x40f3de1650000000 */
+    2.21206695312500000000e+05,  /* 0x410b00b590000000 */
+    6.01302140625000000000e+05,  /* 0x412259ac48000000 */
+    1.63450865625000000000e+06,  /* 0x4138f0cca8000000 */
+    4.44305525000000000000e+06,  /* 0x4150f2ebd0000000 */
+    1.20774762500000000000e+07,  /* 0x4167093488000000 */
+    3.28299845000000000000e+07,  /* 0x417f4f2208000000 */
+    8.92411500000000000000e+07,  /* 0x419546d8f8000000 */
+    2.42582596000000000000e+08,  /* 0x41aceb0888000000 */
+    6.59407856000000000000e+08,  /* 0x41c3a6e1f8000000 */
+    1.79245641600000000000e+09,  /* 0x41dab5adb8000000 */
+    4.87240166400000000000e+09,  /* 0x41f226af30000000 */
+    1.32445608960000000000e+10,  /* 0x4208ab7fb0000000 */
+    3.60024494080000000000e+10,  /* 0x4220c3d390000000 */
+    9.78648043520000000000e+10,  /* 0x4236c93268000000 */
+    2.66024116224000000000e+11,  /* 0x424ef822f0000000 */
+    7.23128516608000000000e+11,  /* 0x42650bba30000000 */
+    1.96566712320000000000e+12,  /* 0x427c9aae40000000 */
+    5.34323724288000000000e+12,  /* 0x4293704708000000 */
+    1.45244246507520000000e+13,  /* 0x42aa6b7658000000 */
+    3.94814795284480000000e+13,  /* 0x42c1f43fc8000000 */
+    1.07321789251584000000e+14,  /* 0x42d866f348000000 */
+    2.91730863685632000000e+14,  /* 0x42f0953e28000000 */
+    7.93006722514944000000e+14,  /* 0x430689e220000000 */
+    2.15561576592179200000e+15}; /* 0x431ea215a0000000 */
+
+  static const double sinh_tail[   37] = {
+    0.00000000000000000000e+00,  /* 0x0000000000000000 */
+    1.60467555584448807892e-08,  /* 0x3e513ae6096a0092 */
+    2.76742892754807136947e-08,  /* 0x3e5db70cfb79a640 */
+    2.09697499555224576530e-07,  /* 0x3e8c2526b66dc067 */
+    2.04940252448908240062e-07,  /* 0x3e8b81b18647f380 */
+    1.65444891522700935932e-06,  /* 0x3ebbc1cdd1e1eb08 */
+    3.53116789999998198721e-06,  /* 0x3ecd9f201534fb09 */
+    6.94023870987375490695e-06,  /* 0x3edd1c064a4e9954 */
+    4.98876893611587449271e-06,  /* 0x3ed4eca65d06ea74 */
+    3.19656024605152215752e-05,  /* 0x3f00c259bcc0ecc5 */
+    2.08687768377236501204e-04,  /* 0x3f2b5a6647cf9016 */
+    4.84668088325403796299e-05,  /* 0x3f09691adefb0870 */
+    1.17517985422733832468e-03,  /* 0x3f53410fc29cde38 */
+    6.90830086959560562415e-04,  /* 0x3f46a31a50b6fb3c */
+    1.45697262451506548420e-03,  /* 0x3f57defc71805c40 */
+    2.99859023684906737806e-02,  /* 0x3f9eb49fd80e0bab */
+    1.02538800507941396667e-02,  /* 0x3f84fffc7bcd5920 */
+    1.26787628407699110022e-01,  /* 0x3fc03a93b6c63435 */
+    6.86652479544033744752e-02,  /* 0x3fb1940bb255fd1c */
+    4.81593627621056619148e-01,  /* 0x3fded26e14260b50 */
+    1.70489513795397629181e+00,  /* 0x3ffb47401fc9f2a2 */
+    1.12416073482258713767e+01,  /* 0x40267bb3f55634f1 */
+    7.06579578070110514432e+00,  /* 0x401c435ff8194ddc */
+    5.91244512999659974639e+01,  /* 0x404d8fee052ba63a */
+    1.68921736147050694399e+02,  /* 0x40651d7edccde3f6 */
+    2.60692936262073658327e+02,  /* 0x40704b1644557d1a */
+    3.62419382134885609048e+02,  /* 0x4076a6b5ca0a9dc4 */
+    4.07689930834187271103e+03,  /* 0x40afd9cc72249aba */
+    1.55377375868385224749e+04,  /* 0x40ce58de693edab5 */
+    2.53720210371943067003e+04,  /* 0x40d8c70158ac6363 */
+    4.78822310734952334315e+04,  /* 0x40e7614764f43e20 */
+    1.81871712615542812273e+05,  /* 0x4106337db36fc718 */
+    5.62892347580489004031e+05,  /* 0x41212d98b1f611e2 */
+    6.41374032312148716301e+05,  /* 0x412392bc108b37cc */
+    7.57809544070145115256e+06,  /* 0x415ce87bdc3473dc */
+    3.64177136406482197344e+06,  /* 0x414bc8d5ae99ad14 */
+    7.63580561355670914054e+06}; /* 0x415d20d76744835c */
+
+  static const double cosh_lead[   37] = {
+    1.00000000000000000000e+00,  /* 0x3ff0000000000000 */
+    1.54308062791824340820e+00,  /* 0x3ff8b07550000000 */
+    3.76219564676284790039e+00,  /* 0x400e18fa08000000 */
+    1.00676617622375488281e+01,  /* 0x402422a490000000 */
+    2.73082327842712402344e+01,  /* 0x403b4ee858000000 */
+    7.42099475860595703125e+01,  /* 0x40528d6fc8000000 */
+    2.01715633392333984375e+02,  /* 0x406936e678000000 */
+    5.48317031860351562500e+02,  /* 0x4081228948000000 */
+    1.49047915649414062500e+03,  /* 0x409749eaa8000000 */
+    4.05154199218750000000e+03,  /* 0x40afa71580000000 */
+    1.10132329101562500000e+04,  /* 0x40c5829dd0000000 */
+    2.99370708007812500000e+04,  /* 0x40dd3c4488000000 */
+    8.13773945312500000000e+04,  /* 0x40f3de1650000000 */
+    2.21206695312500000000e+05,  /* 0x410b00b590000000 */
+    6.01302140625000000000e+05,  /* 0x412259ac48000000 */
+    1.63450865625000000000e+06,  /* 0x4138f0cca8000000 */
+    4.44305525000000000000e+06,  /* 0x4150f2ebd0000000 */
+    1.20774762500000000000e+07,  /* 0x4167093488000000 */
+    3.28299845000000000000e+07,  /* 0x417f4f2208000000 */
+    8.92411500000000000000e+07,  /* 0x419546d8f8000000 */
+    2.42582596000000000000e+08,  /* 0x41aceb0888000000 */
+    6.59407856000000000000e+08,  /* 0x41c3a6e1f8000000 */
+    1.79245641600000000000e+09,  /* 0x41dab5adb8000000 */
+    4.87240166400000000000e+09,  /* 0x41f226af30000000 */
+    1.32445608960000000000e+10,  /* 0x4208ab7fb0000000 */
+    3.60024494080000000000e+10,  /* 0x4220c3d390000000 */
+    9.78648043520000000000e+10,  /* 0x4236c93268000000 */
+    2.66024116224000000000e+11,  /* 0x424ef822f0000000 */
+    7.23128516608000000000e+11,  /* 0x42650bba30000000 */
+    1.96566712320000000000e+12,  /* 0x427c9aae40000000 */
+    5.34323724288000000000e+12,  /* 0x4293704708000000 */
+    1.45244246507520000000e+13,  /* 0x42aa6b7658000000 */
+    3.94814795284480000000e+13,  /* 0x42c1f43fc8000000 */
+    1.07321789251584000000e+14,  /* 0x42d866f348000000 */
+    2.91730863685632000000e+14,  /* 0x42f0953e28000000 */
+    7.93006722514944000000e+14,  /* 0x430689e220000000 */
+    2.15561576592179200000e+15}; /* 0x431ea215a0000000 */
+
+  static const double cosh_tail[   37] = {
+    0.00000000000000000000e+00,  /* 0x0000000000000000 */
+    6.89700037027478056904e-09,  /* 0x3e3d9f5504c2bd28 */
+    4.43207835591715833630e-08,  /* 0x3e67cb66f0a4c9fd */
+    2.33540217013828929694e-07,  /* 0x3e8f58617928e588 */
+    5.17452463948269748331e-08,  /* 0x3e6bc7d000c38d48 */
+    9.38728274131605919153e-07,  /* 0x3eaf7f9d4e329998 */
+    2.73012191010840495544e-06,  /* 0x3ec6e6e464885269 */
+    3.29486051438996307950e-06,  /* 0x3ecba3a8b946c154 */
+    4.75803746362771416375e-06,  /* 0x3ed3f4e76110d5a4 */
+    3.33050940471947692369e-05,  /* 0x3f017622515a3e2b */
+    9.94707313972136215365e-06,  /* 0x3ee4dc4b528af3d0 */
+    6.51685096227860253398e-05,  /* 0x3f11156278615e10 */
+    1.18132406658066663359e-03,  /* 0x3f535ad50ed821f5 */
+    6.93090416366541877541e-04,  /* 0x3f46b61055f2935c */
+    1.45780415323416845386e-03,  /* 0x3f57e2794a601240 */
+    2.99862082708111758744e-02,  /* 0x3f9eb4b45f6aadd3 */
+    1.02539925859688602072e-02,  /* 0x3f85000b967b3698 */
+    1.26787669807076286421e-01,  /* 0x3fc03a940fadc092 */
+    6.86652631843830962843e-02,  /* 0x3fb1940bf3bf874c */
+    4.81593633223853068159e-01,  /* 0x3fded26e1a2a2110 */
+    1.70489514001513020602e+00,  /* 0x3ffb4740205796d6 */
+    1.12416073489841270572e+01,  /* 0x40267bb3f55cb85d */
+    7.06579578098005001152e+00,  /* 0x401c435ff81e18ac */
+    5.91244513000686140458e+01,  /* 0x404d8fee052bdea4 */
+    1.68921736147088438429e+02,  /* 0x40651d7edccde926 */
+    2.60692936262087528121e+02,  /* 0x40704b1644557e0e */
+    3.62419382134890611269e+02,  /* 0x4076a6b5ca0a9e1c */
+    4.07689930834187453002e+03,  /* 0x40afd9cc72249abe */
+    1.55377375868385224749e+04,  /* 0x40ce58de693edab5 */
+    2.53720210371943103382e+04,  /* 0x40d8c70158ac6364 */
+    4.78822310734952334315e+04,  /* 0x40e7614764f43e20 */
+    1.81871712615542812273e+05,  /* 0x4106337db36fc718 */
+    5.62892347580489004031e+05,  /* 0x41212d98b1f611e2 */
+    6.41374032312148716301e+05,  /* 0x412392bc108b37cc */
+    7.57809544070145115256e+06,  /* 0x415ce87bdc3473dc */
+    3.64177136406482197344e+06,  /* 0x414bc8d5ae99ad14 */
+    7.63580561355670914054e+06}; /* 0x415d20d76744835c */
+
+  unsigned long long ux, aux, xneg;
+  double y, z, z1, z2;
+  int m;
+
+  /* Special cases */
+
+  GET_BITS_DP64(x, ux);
+  aux = ux & ~SIGNBIT_DP64;
+  if (aux < 0x3e30000000000000) /* |x| small enough that cosh(x) = 1 */
+  {
+      if (aux == 0)
+        /* with no inexact */
+        return 1.0;
+      else
+        return val_with_flags(1.0, AMD_F_INEXACT);
+  }
+  else if (aux >= PINFBITPATT_DP64) /* |x| is NaN or Inf */
+    {
+      if (aux > PINFBITPATT_DP64) /* |x| is a NaN? */
+         return x + x;
+      else    /* x is infinity */
+         return infinity_with_flags(0);
+    }
+
+  xneg = (aux != ux);
+
+  y = x;
+  if (xneg) y = -x;
+
+  if (y >= max_cosh_arg)
+      {
+      /* Return +/-infinity with overflow flag */
+#ifdef WINDOWS
+         return handle_error("cosh", PINFBITPATT_DP64, _OVERFLOW,
+                              AMD_F_OVERFLOW, EDOM, x, 0.0F);
+#else
+      return retval_errno_erange(x);
+#endif
+
+
+      }
+  else if (y >= small_threshold)
+    {
+      /* In this range y is large enough so that
+         the negative exponential is negligible,
+         so cosh(y) is approximated by sign(x)*exp(y)/2. The
+         code below is an inlined version of that from
+         exp() with two changes (it operates on
+         y instead of x, and the division by 2 is
+         done by reducing m by 1). */
+
+      splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead,
+               log2_by_32_tail, &m, &z1, &z2);
+      m -= 1;
+
+      if (m >= EMIN_DP64 && m <= EMAX_DP64)
+        z = scaleDouble_1((z1+z2),m);
+      else
+        z = scaleDouble_2((z1+z2),m);
+    }
+  else
+    {
+      /* In this range we find the integer part y0 of y 
+         and the increment dy = y - y0. We then compute
+ 
+         z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy)
+         z = cosh(y) = cosh(y0)cosh(dy) + sinh(y0)sinh(dy)
+
+         where sinh(y0) and cosh(y0) are tabulated above. */
+
+      int ind;
+      double dy, dy2, sdy, cdy;
+
+      ind = (int)y;
+      dy = y - ind;
+
+      dy2 = dy*dy;
+      sdy = dy*dy2*(0.166666666666666667013899e0 +
+                    (0.833333333333329931873097e-2 +
+                     (0.198412698413242405162014e-3 +
+                      (0.275573191913636406057211e-5 +
+                       (0.250521176994133472333666e-7 +
+                        (0.160576793121939886190847e-9 +
+                         0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+      cdy = dy2*(0.500000000000000005911074e0 +
+                 (0.416666666666660876512776e-1 +
+                  (0.138888888889814854814536e-2 +
+                   (0.248015872460622433115785e-4 +
+                    (0.275573350756016588011357e-6 +
+                     (0.208744349831471353536305e-8 +
+                      0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+      /* At this point sinh(dy) is approximated by dy + sdy, and cosh(dy) is approximated by 1 + cdy.
+	 Shift some significant bits from dy to cdy. */
+      z = ((((((cosh_tail[ind]*cdy + sinh_tail[ind]*sdy) 
+	       + sinh_tail[ind]*dy) + cosh_tail[ind])  
+	     + cosh_lead[ind]*cdy) + sinh_lead[ind]*sdy) 
+	   + sinh_lead[ind]*dy) + cosh_lead[ind];
+    }
+
+  return z;
+}
+
+weak_alias (__cosh, cosh)

diff --git a/src/coshf.c b/src/coshf.c
new file mode 100644
index 0000000..ab2b68e
--- /dev/null
+++ b/src/coshf.c

@@ -0,0 +1,268 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_SPLITEXP
+#define USE_SCALEDOUBLE_1
+#define USE_SCALEDOUBLE_2
+#define USE_INFINITYF_WITH_FLAGS
+#define USE_VALF_WITH_FLAGS
+#include "../inc/libm_inlines_amd.h"
+#undef USE_SPLITEXP
+#undef USE_SCALEDOUBLE_1
+#undef USE_SCALEDOUBLE_2
+#undef USE_INFINITYF_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline float retval_errno_erange(float x)
+{
+  struct exception exc;
+  exc.arg1 = (double)x;
+  exc.arg2 = (double)x;
+  exc.type = OVERFLOW;
+  exc.name = (char *)"coshf";
+  if (_LIB_VERSION == _SVID_)
+    {
+        exc.retval = HUGE;
+    }
+  else
+    {
+        exc.retval = infinityf_with_flags(AMD_F_OVERFLOW);
+    }
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(ERANGE);
+  else if (!matherr(&exc))
+    __set_errno(ERANGE);
+  return exc.retval;
+}
+
+#endif
+float FN_PROTOTYPE(coshf)(float fx)
+{
+  /*
+    After dealing with special cases the computation is split into
+    regions as follows:
+
+    abs(x) >= max_cosh_arg:
+    cosh(x) = sign(x)*Inf
+
+    abs(x) >= small_threshold:
+    cosh(x) = sign(x)*exp(abs(x))/2 computed using the
+    splitexp and scaleDouble functions as for exp_amd().
+
+    abs(x) < small_threshold:
+    compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0)))
+    cosh(x) is then sign(x)*z.                             */
+
+  static const double
+    /* The max argument of coshf, but stored as a double */
+    max_cosh_arg = 8.94159862922329438106e+01, /* 0x40565a9f84f82e63 */
+    thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */
+    log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */
+    log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */
+
+    small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889;
+//    small_threshold = 20.0;
+  /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */
+
+  /* Tabulated values of sinh(i) and cosh(i) for i = 0,...,36. */
+
+  static const double sinh_lead[   37] = {
+    0.00000000000000000000e+00,  /* 0x0000000000000000 */
+    1.17520119364380137839e+00,  /* 0x3ff2cd9fc44eb982 */
+    3.62686040784701857476e+00,  /* 0x400d03cf63b6e19f */
+    1.00178749274099008204e+01,  /* 0x40240926e70949ad */
+    2.72899171971277496596e+01,  /* 0x403b4a3803703630 */
+    7.42032105777887522891e+01,  /* 0x40528d0166f07374 */
+    2.01713157370279219549e+02,  /* 0x406936d22f67c805 */
+    5.48316123273246489589e+02,  /* 0x408122876ba380c9 */
+    1.49047882578955000099e+03,  /* 0x409749ea514eca65 */
+    4.05154190208278987484e+03,  /* 0x40afa7157430966f */
+    1.10132328747033916443e+04,  /* 0x40c5829dced69991 */
+    2.99370708492480553105e+04,  /* 0x40dd3c4488cb48d6 */
+    8.13773957064298447222e+04,  /* 0x40f3de1654d043f0 */
+    2.21206696003330085659e+05,  /* 0x410b00b5916a31a5 */
+    6.01302142081972560845e+05,  /* 0x412259ac48bef7e3 */
+    1.63450868623590236530e+06,  /* 0x4138f0ccafad27f6 */
+    4.44305526025387924165e+06,  /* 0x4150f2ebd0a7ffe3 */
+    1.20774763767876271158e+07,  /* 0x416709348c0ea4ed */
+    3.28299845686652474105e+07,  /* 0x417f4f22091940bb */
+    8.92411504815936237574e+07,  /* 0x419546d8f9ed26e1 */
+    2.42582597704895108938e+08,  /* 0x41aceb088b68e803 */
+    6.59407867241607308388e+08,  /* 0x41c3a6e1fd9eecfd */
+    1.79245642306579566002e+09,  /* 0x41dab5adb9c435ff */
+    4.87240172312445068359e+09,  /* 0x41f226af33b1fdc0 */
+    1.32445610649217357635e+10,  /* 0x4208ab7fb5475fb7 */
+    3.60024496686929321289e+10,  /* 0x4220c3d3920962c8 */
+    9.78648047144193725586e+10,  /* 0x4236c932696a6b5c */
+    2.66024120300899291992e+11,  /* 0x424ef822f7f6731c */
+    7.23128532145737548828e+11,  /* 0x42650bba3796379a */
+    1.96566714857202099609e+12,  /* 0x427c9aae4631c056 */
+    5.34323729076223046875e+12,  /* 0x429370470aec28ec */
+    1.45244248326237109375e+13,  /* 0x42aa6b765d8cdf6c */
+    3.94814800913403437500e+13,  /* 0x42c1f43fcc4b662c */
+    1.07321789892958031250e+14,  /* 0x42d866f34a725782 */
+    2.91730871263727437500e+14,  /* 0x42f0953e2f3a1ef7 */
+    7.93006726156715250000e+14,  /* 0x430689e221bc8d5a */
+    2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */
+
+  static const double cosh_lead[   37] = {
+    1.00000000000000000000e+00,  /* 0x3ff0000000000000 */
+    1.54308063481524371241e+00,  /* 0x3ff8b07551d9f550 */
+    3.76219569108363138810e+00,  /* 0x400e18fa0df2d9bc */
+    1.00676619957777653269e+01,  /* 0x402422a497d6185e */
+    2.73082328360164865444e+01,  /* 0x403b4ee858de3e80 */
+    7.42099485247878334349e+01,  /* 0x40528d6fcbeff3a9 */
+    2.01715636122455890700e+02,  /* 0x406936e67db9b919 */
+    5.48317035155212010977e+02,  /* 0x4081228949ba3a8b */
+    1.49047916125217807348e+03,  /* 0x409749eaa93f4e76 */
+    4.05154202549259389343e+03,  /* 0x40afa715845d8894 */
+    1.10132329201033226127e+04,  /* 0x40c5829dd053712d */
+    2.99370708659497577173e+04,  /* 0x40dd3c4489115627 */
+    8.13773957125740562333e+04,  /* 0x40f3de1654d6b543 */
+    2.21206696005590405548e+05,  /* 0x410b00b5916b6105 */
+    6.01302142082804115489e+05,  /* 0x412259ac48bf13ca */
+    1.63450868623620807193e+06,  /* 0x4138f0ccafad2d17 */
+    4.44305526025399193168e+06,  /* 0x4150f2ebd0a8005c */
+    1.20774763767876680940e+07,  /* 0x416709348c0ea503 */
+    3.28299845686652623117e+07,  /* 0x417f4f22091940bf */
+    8.92411504815936237574e+07,  /* 0x419546d8f9ed26e1 */
+    2.42582597704895138741e+08,  /* 0x41aceb088b68e804 */
+    6.59407867241607308388e+08,  /* 0x41c3a6e1fd9eecfd */
+    1.79245642306579566002e+09,  /* 0x41dab5adb9c435ff */
+    4.87240172312445068359e+09,  /* 0x41f226af33b1fdc0 */
+    1.32445610649217357635e+10,  /* 0x4208ab7fb5475fb7 */
+    3.60024496686929321289e+10,  /* 0x4220c3d3920962c8 */
+    9.78648047144193725586e+10,  /* 0x4236c932696a6b5c */
+    2.66024120300899291992e+11,  /* 0x424ef822f7f6731c */
+    7.23128532145737548828e+11,  /* 0x42650bba3796379a */
+    1.96566714857202099609e+12,  /* 0x427c9aae4631c056 */
+    5.34323729076223046875e+12,  /* 0x429370470aec28ec */
+    1.45244248326237109375e+13,  /* 0x42aa6b765d8cdf6c */
+    3.94814800913403437500e+13,  /* 0x42c1f43fcc4b662c */
+    1.07321789892958031250e+14,  /* 0x42d866f34a725782 */
+    2.91730871263727437500e+14,  /* 0x42f0953e2f3a1ef7 */
+    7.93006726156715250000e+14,  /* 0x430689e221bc8d5a */
+    2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */
+
+  unsigned long long ux, aux, xneg;
+  double x = fx, y, z, z1, z2;
+  int m;
+
+  /* Special cases */
+
+  GET_BITS_DP64(x, ux);
+  aux = ux & ~SIGNBIT_DP64;
+  if (aux < 0x3f10000000000000) /* |x| small enough that cosh(x) = 1 */
+    {
+      if (aux == 0) return (float)1.0; /* with no inexact */
+      if (LAMBDA_DP64 + x  > 1.0) return valf_with_flags((float)1.0, AMD_F_INEXACT); /* with inexact */
+    }
+  else if (aux >= PINFBITPATT_DP64) /* |x| is NaN or Inf */
+    {
+      if (aux > PINFBITPATT_DP64) /* |x| is a NaN? */
+         return fx + fx;
+      else    /* x is infinity */
+         return infinityf_with_flags(0);
+    }
+
+  xneg = (aux != ux);
+
+  y = x;
+  if (xneg) y = -x;
+
+  if (y >= max_cosh_arg)
+    {
+      /* Return infinity with overflow flag. */
+      /* This handles POSIX behaviour */
+      __set_errno(ERANGE);
+        z = infinityf_with_flags(AMD_F_OVERFLOW);
+    }
+  else if (y >= small_threshold)
+    {
+      /* In this range y is large enough so that
+         the negative exponential is negligible,
+         so cosh(y) is approximated by sign(x)*exp(y)/2. The
+         code below is an inlined version of that from
+         exp() with two changes (it operates on
+         y instead of x, and the division by 2 is
+         done by reducing m by 1). */
+
+      splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead,
+               log2_by_32_tail, &m, &z1, &z2);
+      m -= 1;
+
+      /* scaleDouble_1 is always safe because the argument x was
+         float, rather than double */
+
+      z = scaleDouble_1((z1+z2),m);
+    }
+  else
+    {
+      /* In this range we find the integer part y0 of y 
+         and the increment dy = y - y0. We then compute
+ 
+         z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy)
+         z = cosh(y) = cosh(y0)cosh(dy) + sinh(y0)sinh(dy)
+
+         where sinh(y0) and cosh(y0) are tabulated above. */
+
+      int ind;
+      double dy, dy2, sdy, cdy;
+
+      ind = (int)y;
+      dy = y - ind;
+
+      dy2 = dy*dy;
+
+      sdy = dy + dy*dy2*(0.166666666666666667013899e0 +
+			 (0.833333333333329931873097e-2 +
+			  (0.198412698413242405162014e-3 +
+			   (0.275573191913636406057211e-5 +
+			    (0.250521176994133472333666e-7 +
+			     (0.160576793121939886190847e-9 +
+			      0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+      cdy = 1 + dy2*(0.500000000000000005911074e0 +
+		     (0.416666666666660876512776e-1 +
+		      (0.138888888889814854814536e-2 +
+		       (0.248015872460622433115785e-4 +
+			(0.275573350756016588011357e-6 +
+			 (0.208744349831471353536305e-8 +
+			  0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+      z = cosh_lead[ind]*cdy + sinh_lead[ind]*sdy;
+    }
+
+//  if (xneg) z = - z;
+  return (float)z;
+}
+
+weak_alias (__coshf, coshf)

diff --git a/src/exp_special.c b/src/exp_special.c
new file mode 100644
index 0000000..ca32ec2
--- /dev/null
+++ b/src/exp_special.c

@@ -0,0 +1,110 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+#ifdef __x86_64__
+
+#include <emmintrin.h>
+#include <math.h>
+#include <errno.h>
+
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+// y = expf(x)
+// y = exp(x)
+
+// these codes and the ones in the related .S or .asm files have to match
+#define EXP_X_NAN       1
+#define EXP_Y_ZERO      2
+#define EXP_Y_INF       3
+
+float _expf_special(float x, float y, U32 code)
+{
+    switch(code)
+    {
+    case EXP_X_NAN:
+        {
+#ifdef WIN64
+            // y is assumed to be qnan, only check x for snan
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "expf", x, is_x_snan, 0.0f, 0, y, 0);
+#else
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+        }
+        break;
+
+    case EXP_Y_ZERO:
+        {
+            _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW));
+            __amd_handle_errorf(UNDERFLOW, ERANGE, "expf", x, 0, 0.0f, 0, y, 0);
+        }
+        break;
+
+    case EXP_Y_INF:
+        {
+            _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW));
+            __amd_handle_errorf(OVERFLOW, ERANGE, "expf", x, 0, 0.0f, 0, y, 0);
+        }
+        break;
+    }
+
+
+    return y;
+}
+
+double _exp_special(double x, double y, U32 code)
+{
+    switch(code)
+    {
+    case EXP_X_NAN:
+        {
+#ifdef WIN64
+            __amd_handle_error(DOMAIN, EDOM, "exp", x, 0.0, y);
+#else
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+        }
+        break;
+
+    case EXP_Y_ZERO:
+        {
+            _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW));
+            __amd_handle_error(UNDERFLOW, ERANGE, "exp", x, 0.0, y);
+        }
+        break;
+
+    case EXP_Y_INF:
+        {
+            _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW));
+            __amd_handle_error(OVERFLOW, ERANGE, "exp", x, 0.0, y);
+        }
+        break;
+    }
+
+
+    return y;
+}
+
+#endif /* __x86_64__ */

diff --git a/src/finite.c b/src/finite.c
new file mode 100644
index 0000000..7e7ca39
--- /dev/null
+++ b/src/finite.c

@@ -0,0 +1,60 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+/* Returns 0 if x is infinite or NaN, otherwise returns 1 */
+
+#ifdef WINDOWS
+int FN_PROTOTYPE(finite)(double x)
+#else
+int FN_PROTOTYPE(finite)(double x)
+#endif
+{
+
+#ifdef WINDOWS
+
+  unsigned long long ux;
+  GET_BITS_DP64(x, ux);
+  return (int)(((ux & ~SIGNBIT_DP64) - PINFBITPATT_DP64) >> 63);
+
+#else
+
+  /* This works on Hammer with gcc */
+  unsigned long ux =0x7ff0000000000000 ;
+  double temp;
+  PUT_BITS_DP64(ux, temp);
+
+ // double temp = 1.0e444; /* = infinity = 0x7ff0000000000000 */
+  volatile int retval;
+  retval = 0;
+  asm volatile ("andpd	%0, %1;" : : "x" (temp), "x" (x));
+  asm volatile ("comisd	%0, %1" : : "x" (temp), "x" (x));
+  asm volatile ("setnz	%0" : "=g" (retval));
+  return retval;
+
+#endif
+}
+
+weak_alias (__finite, finite)

diff --git a/src/finitef.c b/src/finitef.c
new file mode 100644
index 0000000..8c0613a
--- /dev/null
+++ b/src/finitef.c

@@ -0,0 +1,60 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+/* Returns 0 if x is infinite or NaN, otherwise returns 1 */
+
+#ifdef WINDOWS
+int FN_PROTOTYPE(finitef)(float x)
+#else
+int FN_PROTOTYPE(finitef)(float x)
+#endif
+{
+
+#ifdef WINDOWS
+
+  unsigned int ux;
+  GET_BITS_SP32(x, ux);
+  return (int)(((ux & ~SIGNBIT_SP32) - PINFBITPATT_SP32) >> 31);
+
+#else
+
+  /* This works on Hammer */
+  unsigned int ux=0x7f800000;
+  float temp;    
+  PUT_BITS_SP32(ux, temp);
+
+ /* float temp = 1.0e444; *//* = infinity = 0x7f800000 */
+  volatile int retval;
+  retval = 0;
+  asm volatile ("andps	%0, %1;" : : "x" (temp), "x" (x));
+  asm volatile ("comiss	%0, %1" : : "x" (temp), "x" (x));
+  asm volatile ("setnz	%0" : "=g" (retval));
+  return retval;
+
+#endif
+}
+
+weak_alias (__finitef, finitef)

diff --git a/src/floor.c b/src/floor.c
new file mode 100644
index 0000000..a1b99c5
--- /dev/null
+++ b/src/floor.c

@@ -0,0 +1,92 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERROR
+#endif
+
+#ifdef WINDOWS
+#pragma function(floor)
+#endif
+
+double FN_PROTOTYPE(floor)(double x)
+{
+  double r;
+  long long rexp, xneg;
+
+
+  unsigned long long ux, ax, ur, mask;
+
+  GET_BITS_DP64(x, ux);
+  ax = ux & (~SIGNBIT_DP64);
+  xneg = (ux != ax);
+
+  if (ax >= 0x4340000000000000)
+    {
+      /* abs(x) is either NaN, infinity, or >= 2^53 */
+      if (ax > 0x7ff0000000000000)
+        /* x is NaN */
+#ifdef WINDOWS
+        return handle_error("floor", ux|0x0008000000000000, _DOMAIN,
+                            0, EDOM, x, 0.0);
+#else
+        return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+      else
+        return x;
+    }
+  else if (ax < 0x3ff0000000000000) /* abs(x) < 1.0 */
+    {
+      if (ax == 0x0000000000000000)
+        /* x is +zero or -zero; return the same zero */
+        return x;
+      else if (xneg) /* x < 0.0 */
+        return -1.0;
+      else
+        return 0.0;
+    }
+  else
+    {
+      r = x;
+      rexp = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+      /* Mask out the bits of r that we don't want */
+      mask = 1;
+      mask = (mask << (EXPSHIFTBITS_DP64 - rexp)) - 1;
+      ur = (ux & ~mask);
+      PUT_BITS_DP64(ur, r);
+      if (xneg && (ur != ux))
+        /* We threw some bits away and x was negative */
+        return r - 1.0;
+      else
+        return r;
+    }
+
+}
+
+weak_alias (__floor, floor)

diff --git a/src/floorf.c b/src/floorf.c
new file mode 100644
index 0000000..e0f855b
--- /dev/null
+++ b/src/floorf.c

@@ -0,0 +1,87 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERRORF
+#endif
+
+#ifdef WINDOWS
+#pragma function(floorf)
+#endif
+
+float FN_PROTOTYPE(floorf)(float x)
+{
+  float r;
+  int rexp, xneg;
+  unsigned int ux, ax, ur, mask;
+
+  GET_BITS_SP32(x, ux);
+  ax = ux & (~SIGNBIT_SP32);
+  xneg = (ux != ax);
+
+  if (ax >= 0x4b800000)
+    {
+      /* abs(x) is either NaN, infinity, or >= 2^24 */
+      if (ax > 0x7f800000)
+        /* x is NaN */
+#ifdef WINDOWS
+        return handle_errorf("floorf", ux|0x00400000, _DOMAIN,
+                             0, EDOM, x, 0.0F);
+#else
+        return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+      else
+        return x;
+    }
+  else if (ax < 0x3f800000) /* abs(x) < 1.0 */
+    {
+      if (ax == 0x00000000)
+        /* x is +zero or -zero; return the same zero */
+        return x;
+      else if (xneg) /* x < 0.0 */
+        return -1.0F;
+      else
+        return 0.0F;
+    }
+  else
+    {
+      rexp = ((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+      /* Mask out the bits of r that we don't want */
+      mask = (1 << (EXPSHIFTBITS_SP32 - rexp)) - 1;
+      ur = (ux & ~mask);
+      PUT_BITS_SP32(ur, r);
+      if (xneg && (ux != ur))
+        /* We threw some bits away and x was negative */
+        return r - 1.0F;
+      else
+        return r;
+    }
+}
+
+weak_alias (__floorf, floorf)

diff --git a/src/frexp.c b/src/frexp.c
new file mode 100644
index 0000000..0ae109c
--- /dev/null
+++ b/src/frexp.c

@@ -0,0 +1,54 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+double FN_PROTOTYPE(frexp)(double value, int *exp)
+{
+    UT64 val;
+    unsigned int sign;
+    int exponent;
+    val.f64 = value;
+    sign = val.u32[1] & SIGNBIT_SP32;
+    val.u32[1] = val.u32[1] & ~SIGNBIT_SP32; /* remove the sign bit */
+    *exp = 0;
+    if((val.f64 == 0.0) || ((val.u32[1] & 0x7ff00000)== 0x7ff00000)) 
+        return value; /* value= +-0 or value= nan or value = +-inf return value */
+
+    exponent = val.u32[1] >> 20; /* get the exponent */
+
+    if(exponent == 0)/*x is denormal*/
+    {
+		val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/
+        exponent = val.u32[1] >> 20; /* get the exponent */
+		exponent = exponent - MULTIPLIER_DP;
+    }
+
+	exponent -= 1022; /* remove bias(1023)-1 */
+    *exp = exponent; /* set the integral power of two */
+    val.u32[1] = sign | 0x3fe00000 | (val.u32[1] & 0x000fffff);/* make the fractional part(divide by 2) */                                              
+    return val.f64;
+}
+

diff --git a/src/frexpf.c b/src/frexpf.c
new file mode 100644
index 0000000..e2b4ece
--- /dev/null
+++ b/src/frexpf.c

@@ -0,0 +1,55 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+
+float FN_PROTOTYPE(frexpf)(float value, int *exp)
+{
+    UT32 val;
+    unsigned int sign;
+    int exponent;
+    val.f32 = value;
+    sign = val.u32 & SIGNBIT_SP32;
+    val.u32 = val.u32 & ~SIGNBIT_SP32; /* remove the sign bit */
+    *exp = 0;
+    if((val.f32 == 0.0) || ((val.u32 & 0x7f800000)== 0x7f800000)) 
+        return value; /* value= +-0 or value= nan or value = +-inf return value */
+
+    exponent = val.u32 >> 23; /* get the exponent */
+
+	if(exponent == 0)/*x is denormal*/
+	{
+		val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/
+		exponent = (val.u32 >> 23); /* get the exponent */
+		exponent = exponent - MULTIPLIER_SP;
+	}
+
+    exponent -= 126; /* remove bias(127)-1 */
+    *exp = exponent; /* set the integral power of two */
+    val.u32 = sign | 0x3f000000 | (val.u32 & 0x007fffff);/* make the fractional part(divide by 2) */                                              
+    return val.f32;
+}
+

diff --git a/src/gas/cbrt.S b/src/gas/cbrt.S
new file mode 100644
index 0000000..b733a1a
--- /dev/null
+++ b/src/gas/cbrt.S

@@ -0,0 +1,1575 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# cbrt.S
+#
+# An implementation of the cbrt libm function.
+#
+# Prototype:
+#
+#     double cbrt(double x);
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(cbrt)
+#define fname_special _cbrt_special
+
+
+# local variable storage offsets
+
+.equ   store_input, -0x10 
+.equ   stack_size, 0x20 
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 32
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+    xor   %rdx,%rdx
+    #for the time being the stack pointer is not changed at all 
+    #Assuming that this is a leaf procedure we can avoid the decrementing and incrementing
+    #of the stack pointer. This will save some assembly operations and give us good performance
+    #results. If there is a procedure call then we need to look at the changes in the stack pointer. 
+    #sub   $stack_size, %rsp
+    movd  %xmm0,%rax
+    movsd %xmm0,%xmm6
+    mov   .L__exp_mask_64(%rip),%r10
+    mov   .L__mantissa_mask_64(%rip),%r11
+    mov   %rax,%r9
+    and   %r10,%rax # rax = stores the exponent
+    and   %r11,%r9 # r9 = stores the mantissa
+    shr   $52,%rax
+    cmp   $0X7FF,%rax
+    jz    .L__cbrt_is_Nan_Infinite
+    cmp   $0X0,%rax
+    jz    .L__cbrt_is_denormal
+    
+.align 32
+.L__cbrt_is_normal:   
+    mov   $3,%rcx   # cx is set to 3 to perform division and get the scale and remainder
+    pand  .L__sign_bit_64(%rip),%xmm6  # xmm6 contains the sign
+    sub   $0x3FF,%ax
+    #we don't need the compare as sub instruction will raise the flags. But there was no performance improvement
+    cmp   $0,%ax 
+    jge   .L__donot_change_dx
+    not   %dx
+.L__donot_change_dx:
+    idiv  %cx #Accumulator is divided by bl=3
+              #ax contains the quotient
+              #dx contains the remainder
+    mov   %dx,%cx
+    add   $0x3FF,%ax
+    shl   $52,%rax
+    add   $2,%cx
+    shl   $1,%cx
+                 #ax = Contains the quotient, Scale factor
+    mov   %rax,store_input(%rsp)
+    movsd store_input(%rsp),%xmm7 #xmm7 is the scaling factor = mf
+    #xmm0 is the modified input value from the denaormal cases 
+    pand  .L__mantissa_mask_64(%rip),%xmm0
+    por   .L__zero_point_five(%rip),%xmm0 #xmm0 = Y
+    mov   %r9,%r10
+    shr   $43,%r10 
+    shr   $44,%r9 
+    and   $0x01,%r10
+    or    $0x0100,%r9
+    add   %r9,%r10 #r10 =  index_u64
+    cvtsi2sd %r10,%xmm4 #xmm4 = index_f64
+    sub $256,%r10
+    lea .L__INV_TAB_256(%rip),%rax
+    mulsd .L__one_by_512(%rip), %xmm4  #xmm4 = F
+    subsd %xmm4,%xmm0 # xmm0 = f
+    movsd (%rax,%r10,8),%xmm4 
+    mulsd %xmm4,%xmm0  # xmm0 = r 
+   
+    #Now perform polynomial computation
+    
+    # movddup %xmm0,%xmm0 # xmm0 = r  ,r
+    shufpd  $0,%xmm0,%xmm0 # replacing movddup
+
+    mulsd   %xmm0,%xmm0 # xmm0 = r  ,r^2
+
+    movapd   %xmm0,%xmm4 # xmm4 = r  ,r^2
+    movapd   %xmm0,%xmm3 # xmm3 = r  ,r^2
+    mulpd   %xmm0,%xmm0 # xmm0 = r^2,r^4   #########
+    mulpd   %xmm0,%xmm3 # xmm3 = r^3,r^6   #########
+    movapd  %xmm3,%xmm2
+    mulpd   .L__coefficients_3_6(%rip),%xmm2 # xmm2 = [coeff3 * r^3, coeff6 * r^6]
+    mulpd   %xmm0,%xmm3 # xmm3 = r^5,r^10 We don't need r^10
+    unpckhpd %xmm3,%xmm4 #xmm4 = r^5,r
+    mulpd   .L__coefficients_2_4(%rip),%xmm0 # xmm0 = [coeff2 * r^2, coeff4 * r^4]
+    mulpd   .L__coefficients_5_1(%rip),%xmm4 # xmm4 = [coeff5 * r^5, coeff1 * r  ]
+    movapd %xmm4,%xmm3
+    unpckhpd %xmm3,%xmm3          #xmm3 = [~Don't Care ,coeff5 * r^5]
+    addsd %xmm3,%xmm2 # xmm2 = [coeff3 * r^3, coeff5 * r^5 + coeff6 * r^6]
+    addpd %xmm2,%xmm0 # xmm0 = [coeff2 * r^2 + coeff3 * r^3,coeff4 * r^4 + coeff5 * r^5 + coeff6 * r^6]  
+    movapd %xmm0,%xmm2 
+    unpckhpd %xmm2,%xmm2          #xmm3 = [~Don't Care ,coeff2 * r^2 + coeff3 * r^3]
+    addsd  %xmm2,%xmm0 # xmm0 = [~Don't Care, coeff2 * r^2 + coeff3 * r^3 + coeff4 * r^4 + coeff5 * r^5 + coeff6 * r^6]
+    addsd  %xmm4,%xmm0 # xmm0 = [~Don't Care, coeff1 * r   + coeff2 * r^2 + coeff3 * r^3 + coeff4 * r^4 + coeff5 * r^5 + coeff6 * r^6]
+    
+    # movddup %xmm0,%xmm0
+    shufpd  $0,%xmm0,%xmm0 # replacing movddup
+
+    
+    #Polynomial computation completes here
+    #Now compute the following
+    #switch(rem)
+    #{
+    #    case -2:    cbrtRem_h.u64 = 0x3fe428a2f0000000; cbrtRem_t.u64 = 0x3e531ae515c447bb; break;
+    #    case -1:    cbrtRem_h.u64 = 0x3fe965fea0000000; cbrtRem_t.u64 = 0x3e44f5b8f20ac166; break;
+    #    case 0:     cbrtRem_h.u64 = 0x3ff0000000000000; cbrtRem_t.u64 = 0x0000000000000000; break;
+    #    case 1:     cbrtRem_h.u64 = 0x3ff428a2f0000000; cbrtRem_t.u64 = 0x3e631ae515c447bb; break;
+    #    case 2:     cbrtRem_h.u64 = 0x3ff965fea0000000; cbrtRem_t.u64 = 0x3e54f5b8f20ac166; break;
+    #    default:    break;
+    #}
+    #cbrtF_h.u64 = CBRT_F_H[index_u64-256];
+    #cbrtF_t.u64 = CBRT_F_T[index_u64-256];
+    #
+    #bH = (cbrtF_h.f64 * cbrtRem_h.f64);
+    #bT = ((((cbrtF_t.f64 * cbrtRem_t.f64)) + (cbrtF_t.f64 * cbrtRem_h.f64)) + (cbrtRem_t.f64 * cbrtF_h.f64));
+    lea .L__cuberoot_remainder_h_l(%rip),%r8  # load both head and tail of the remainders cuberoot at once
+    movapd (%r8,%rcx,8),%xmm1 # xmm1 = [cbrtRem_h.f64,cbrtRem_t.f64]
+    shl $1,%r10
+    lea .L__CBRT_F_H_L_256(%rip),%rax
+    movapd (%rax,%r10,8),%xmm2 # xmm2 = [cbrtF_h.f64,cbrtF_t.f64]
+    movapd %xmm2,%xmm3     
+    psrldq $8,%xmm3           # xmm3 = [~Dont Care,cbrtF_h.f64]
+    unpcklpd %xmm2,%xmm3      # xmm3 = [cbrtF_t.f64,cbrtF_h.f64]
+
+    mulpd  %xmm1,%xmm2        # xmm2 = [(cbrtF_h.f64*cbrtRem_h.f64),(cbrtRem_t.f64*cbrtF_t.f64)]
+    mulpd  %xmm1,%xmm3        # xmm3 = [(cbrtRem_h.f64*cbrtF_t.f64),(cbrtRem_t.f64*cbrtF_h.f64)]
+    movapd %xmm3,%xmm4        
+    unpckhpd %xmm4,%xmm4      # xmm4 = [(cbrtRem_h.f64*cbrtF_t.f64),(cbrtRem_h.f64*cbrtF_t.f64)]
+    addsd    %xmm4,%xmm3      # xmm3 = [~Dont Care, ((cbrtRem_h.f64*cbrtF_t.f64) + (cbrtRem_t.f64*cbrtF_h.f64))]
+    addsd    %xmm3,%xmm2      # xmm2 = [(cbrtF_h.f64*cbrtRem_h.f64),(((cbrtRem_t.f64*cbrtF_t.f64)+(cbrtRem_h.f64*cbrtF_t.f64) + (cbrtRem_t.f64*cbrtF_h.f64))]
+                              # xmm2 = [bH,bT]
+    # Now calculate
+    #ans.f64 = (((((z * bT)) + (bT)) + (z * bH)) + (bH));
+    #ans.f64 = ans.f64 * mf;
+    #ans.u64 = ans.u64 | sign.u64;
+
+    movapd   %xmm2,%xmm3
+    unpckhpd %xmm3,%xmm3      # xmm3 = [Dont Care,bH]
+                              # also xmm0 = [z,z] = the polynomial which was computed earlier
+    mulpd    %xmm2,%xmm0      # xmm0 = [(bH*z),(bT*z)]
+    movapd   %xmm0,%xmm4      
+    unpckhpd %xmm4,%xmm4      # xmm4 = [(bH*z),(bH*z)]
+    addsd    %xmm2,%xmm0      # xmm0 = [~DontCare, ((bT*z) + bT)]
+    unpckhpd %xmm2,%xmm2      # xmm2 = [(bH),(bH)]
+    addsd    %xmm4,%xmm0      # xmm0 = [~DontCare, (((bT*z) + bT) + ( z*bH))]
+    addsd    %xmm2,%xmm0      # xmm0 = [~DontCare, ((((bT*z) + bT) + (z*bH)) + bH)] = [~Dont Care,ans.f64]
+    mulsd    %xmm7,%xmm0      # xmm0 = ans.f64 * mf; mf is the scaling factor
+    por      %xmm6,%xmm0      # restore the sign
+    #add   $stack_size, %rsp
+    ret
+    
+         
+.align 32
+.L__cbrt_is_denormal:
+    movsd  .L__one_mask_64(%rip),%xmm4
+    cmp    $0,%r9
+    jz     .L__cbrt_is_zero
+    pand  .L__sign_mask_64(%rip),%xmm0   
+    por    %xmm4,%xmm0
+    subsd  %xmm4,%xmm0
+    movd   %xmm0,%rax
+    mov    %rax,%r9
+    and    %r10,%rax # rax = stores the exponent
+    and    %r11,%r9 # r9 = stores the mantissa
+    shr    $52,%rax
+    sub    $1022,%rax
+    jmp    .L__cbrt_is_normal
+
+.align 32
+.L__cbrt_is_zero:
+    ret
+.align 32
+.L__cbrt_is_Nan_Infinite:
+    cmp $0,%r9
+    jz .L__cbrt_is_Infinite
+    mulsd %xmm0,%xmm0 #this multiplication will raise an invalid exception
+    por  .L__qnan_mask_64(%rip),%xmm0
+.L__cbrt_is_Infinite: 
+    #add   $stack_size, %rsp
+    ret
+
+.align 32 
+.L__mantissa_mask_64:      .quad 0x000FFFFFFFFFFFFF
+                           .quad 0          #this zero is necessary
+.L__qnan_mask_64:          .quad 0x0008000000000000
+.L__exp_mask_64:           .quad 0x7FF0000000000000
+                           .quad 0
+.L__zero:                  .quad 0x0000000000000000
+                           .quad 0
+.align 32                           
+.L__zero_point_five:       .quad 0x3FE0000000000000
+                           .quad 0
+.align 16
+.L__sign_mask_64:          .quad 0x7FFFFFFFFFFFFFFF 
+                           .quad 0
+.L__sign_bit_64:           .quad 0x8000000000000000 
+                           .quad 0
+.L__one_mask_64:           .quad 0x3FF0000000000000 
+                           .quad 0
+.L__one_by_512:            .quad 0x3f60000000000000
+                           .quad 0
+
+
+.align 16
+.L__denormal_factor:       .quad 0x3F7428A2F98D728B 
+                           .quad 0
+# The coeeficients are arranged in a specific order to aid parrallel multiplication
+# The numbers corresponding to each coeff corresponds to the rth order to which it is to 
+# be multiplied 
+.L__coefficients:           
+.align 32 
+.L__coefficients_5_1:       .quad 0x3fd5555555555555 # 1
+                            .quad 0x3f9ee7113506ac13 # 5
+.L__coefficients_2_4:       .quad 0xbfa511e8d2b3183b # 4
+                            .quad 0xbfbc71c71c71c71c # 2
+.L__coefficients_3_6:       .quad 0xbf98090d6221a247 # 6
+                            .quad 0x3faf9add3c0ca458 # 3
+                            .quad 0x3f93750ad588f114 # 7
+
+
+
+.align 32
+.L__cuberoot_remainder_h_l: 
+                            .quad 0x3e531ae515c447bb  # cbrt(2^-2) Low
+                            .quad 0x3FE428A2F0000000  # cbrt(2^-2) High
+                            .quad 0x3e44f5b8f20ac166  # cbrt(2^-1) Low
+                            .quad 0x3FE965FEA0000000  # cbrt(2^-1) High
+                            .quad 0x0000000000000000  # cbrt(2^0) Low
+                            .quad 0x3FF0000000000000  # cbrt(2^0) High
+                            .quad 0x3e631ae515c447bb  # cbrt(2^1) Low
+                            .quad 0x3FF428A2F0000000  # cbrt(2^1) High
+                            .quad 0x3e54f5b8f20ac166  # cbrt(2^2) Low
+                            .quad 0x3FF965FEA0000000  # cbrt(2^2) High
+
+
+
+#interleaved high and low values
+.align 32
+.L__CBRT_F_H_L_256:
+	.quad 0x0000000000000000							
+    .quad 0x3ff0000000000000
+	.quad 0x3e6e6a24c81e4294							
+    .quad 0x3ff0055380000000
+	.quad 0x3e58548511e3a785							
+    .quad 0x3ff00aa390000000
+	.quad 0x3e64eb9336ec07f6							
+    .quad 0x3ff00ff010000000
+	.quad 0x3e40ea64b8b750e1							
+    .quad 0x3ff0153920000000
+	.quad 0x3e461637cff8a53c							
+    .quad 0x3ff01a7eb0000000
+	.quad 0x3e40733bf7bd1943
+    .quad 0x3ff01fc0d0000000
+	.quad 0x3e5666911345cced
+    .quad 0x3ff024ff80000000
+	.quad 0x3e477b7a3f592f14							
+    .quad 0x3ff02a3ad0000000
+	.quad 0x3e6f18d3dd1a5402							
+    .quad 0x3ff02f72b0000000
+	.quad 0x3e2be2f5a58ee9a4							
+    .quad 0x3ff034a750000000
+	.quad 0x3e68901f8f085fa7							
+    .quad 0x3ff039d880000000
+	.quad 0x3e5c68b8cd5b5d69							
+    .quad 0x3ff03f0670000000
+	.quad 0x3e5a6b0e8624be42							
+    .quad 0x3ff0443110000000
+	.quad 0x3dbc4b22b06f68e7							
+    .quad 0x3ff0495870000000
+	.quad 0x3e60f3f0afcabe9b							
+    .quad 0x3ff04e7c80000000
+	.quad 0x3e548495bca4e1b7							
+    .quad 0x3ff0539d60000000
+	.quad 0x3e66107f1abdfdc3							
+    .quad 0x3ff058bb00000000
+	.quad 0x3e6e67261878288a							
+    .quad 0x3ff05dd570000000
+	.quad 0x3e5a6bc155286f1e							
+    .quad 0x3ff062ecc0000000
+	.quad 0x3e58a759c64a85f2							
+    .quad 0x3ff06800e0000000
+	.quad 0x3e45fce70a4a8d09							
+    .quad 0x3ff06d11e0000000
+	.quad 0x3e32f9cbf373fe1d							
+    .quad 0x3ff0721fc0000000
+	.quad 0x3e590564ce4ac359							
+    .quad 0x3ff0772a80000000
+	.quad 0x3e5ac29ce761b02f							
+    .quad 0x3ff07c3230000000
+	.quad 0x3e5cb752f497381c							
+    .quad 0x3ff08136d0000000
+	.quad 0x3e68bb9e1cfb35e0							
+    .quad 0x3ff0863860000000
+	.quad 0x3e65b4917099de90							
+    .quad 0x3ff08b36f0000000
+	.quad 0x3e5cc77ac9c65ef2							
+    .quad 0x3ff0903280000000
+	.quad 0x3e57a0f3e7be3dba							
+    .quad 0x3ff0952b10000000
+	.quad 0x3e66ec851ee0c16f							
+    .quad 0x3ff09a20a0000000
+	.quad 0x3e689449bf2946da							
+    .quad 0x3ff09f1340000000
+	.quad 0x3e698f25301ba223							
+    .quad 0x3ff0a402f0000000
+	.quad 0x3e347d5ec651f549							
+    .quad 0x3ff0a8efc0000000
+	.quad 0x3e6c33ec9a86007a							
+    .quad 0x3ff0add990000000
+	.quad 0x3e5e0b6653e92649							
+    .quad 0x3ff0b2c090000000
+	.quad 0x3e3bd64ac09d755f							
+    .quad 0x3ff0b7a4b0000000
+	.quad 0x3e2f537506f78167							
+    .quad 0x3ff0bc85f0000000
+	.quad 0x3e62c382d1b3735e							
+    .quad 0x3ff0c16450000000
+	.quad 0x3e6e20ed659f99e1							
+    .quad 0x3ff0c63fe0000000
+	.quad 0x3e586b633a9c182a							
+    .quad 0x3ff0cb18b0000000
+	.quad 0x3e445cfd5a65e777							
+    .quad 0x3ff0cfeeb0000000
+	.quad 0x3e60c8770f58bca4							
+    .quad 0x3ff0d4c1e0000000
+	.quad 0x3e6739e44b0933c5							
+    .quad 0x3ff0d99250000000
+	.quad 0x3e027dc3d9ce7bd8							
+    .quad 0x3ff0de6010000000
+	.quad 0x3e63c53c7c5a7b64							
+    .quad 0x3ff0e32b00000000
+	.quad 0x3e69669683830cec							
+    .quad 0x3ff0e7f340000000
+	.quad 0x3e68d772c39bdcc4							
+    .quad 0x3ff0ecb8d0000000
+	.quad 0x3e69b0008bcf6d7b							
+    .quad 0x3ff0f17bb0000000
+	.quad 0x3e3bbb305825ce4f							
+    .quad 0x3ff0f63bf0000000
+	.quad 0x3e6da3f4af13a406							
+    .quad 0x3ff0faf970000000
+	.quad 0x3e5f36b96f74ce86							
+    .quad 0x3ff0ffb460000000
+	.quad 0x3e165c002303f790							
+    .quad 0x3ff1046cb0000000
+	.quad 0x3e682f84095ba7d5							
+    .quad 0x3ff1092250000000
+	.quad 0x3e6d46433541b2c6							
+    .quad 0x3ff10dd560000000
+	.quad 0x3e671c3d56e93a89							
+    .quad 0x3ff11285e0000000
+	.quad 0x3e598dcef4e40012							
+    .quad 0x3ff11733d0000000
+	.quad 0x3e4530ebef17fe03							
+    .quad 0x3ff11bdf30000000
+	.quad 0x3e4e8b8fa3715066							
+    .quad 0x3ff1208800000000
+	.quad 0x3e6ab26eb3b211dc							
+    .quad 0x3ff1252e40000000
+	.quad 0x3e454dd4dc906307							
+    .quad 0x3ff129d210000000
+	.quad 0x3e5c9f962387984e							
+    .quad 0x3ff12e7350000000
+	.quad 0x3e6c62a959afec09							
+    .quad 0x3ff1331210000000
+	.quad 0x3e6638d9ac6a866a
+    .quad 0x3ff137ae60000000
+	.quad 0x3e338704eca8a22d							
+    .quad 0x3ff13c4840000000
+	.quad 0x3e4e6c9e1db14f8f							
+    .quad 0x3ff140dfa0000000
+	.quad 0x3e58744b7f9c9eaa							
+    .quad 0x3ff1457490000000
+	.quad 0x3e66c2893486373b							
+    .quad 0x3ff14a0710000000
+	.quad 0x3e5b36bce31699b7							
+    .quad 0x3ff14e9730000000
+	.quad 0x3e671e3813d200c7							
+    .quad 0x3ff15324e0000000
+	.quad 0x3e699755ab40aa88							
+    .quad 0x3ff157b030000000
+	.quad 0x3e6b45ca0e4bcfc0							
+    .quad 0x3ff15c3920000000
+	.quad 0x3e32dd090d869c5d							
+    .quad 0x3ff160bfc0000000
+	.quad 0x3e64fe0516b917da
+    .quad 0x3ff16543f0000000
+	.quad 0x3e694563226317a2							
+    .quad 0x3ff169c5d0000000
+	.quad 0x3e653d8fafc2c851							
+    .quad 0x3ff16e4560000000
+	.quad 0x3e5dcbd41fbd41a3							
+    .quad 0x3ff172c2a0000000
+	.quad 0x3e5862ff5285f59c							
+    .quad 0x3ff1773d90000000
+	.quad 0x3e63072ea97a1e1c							
+    .quad 0x3ff17bb630000000
+	.quad 0x3e52839075184805							
+    .quad 0x3ff1802c90000000
+	.quad 0x3e64b0323e9eff42							
+    .quad 0x3ff184a0a0000000
+	.quad 0x3e6b158893c45484							
+    .quad 0x3ff1891270000000
+	.quad 0x3e3149ef0fc35826							
+    .quad 0x3ff18d8210000000
+	.quad 0x3e5f2e77ea96acaa							
+    .quad 0x3ff191ef60000000
+	.quad 0x3e5200074c471a95							
+    .quad 0x3ff1965a80000000
+	.quad 0x3e63f8cc517f6f04							
+    .quad 0x3ff19ac360000000
+	.quad 0x3e660ba2e311bb55							
+    .quad 0x3ff19f2a10000000
+	.quad 0x3e64b788730bbec3							
+    .quad 0x3ff1a38e90000000
+	.quad 0x3e657090795ee20c							
+    .quad 0x3ff1a7f0e0000000
+	.quad 0x3e6d9ffe983670b1							
+    .quad 0x3ff1ac5100000000
+	.quad 0x3e62a463ff61bfda							
+    .quad 0x3ff1b0af00000000
+	.quad 0x3e69d1bc6a5e65cf							
+    .quad 0x3ff1b50ad0000000
+	.quad 0x3e68718abaa9e922							
+    .quad 0x3ff1b96480000000
+	.quad 0x3e63c2f52ffa342e							
+    .quad 0x3ff1bdbc10000000
+	.quad 0x3e60fae13ff42c80							
+    .quad 0x3ff1c21180000000
+	.quad 0x3e65440f0ef00d57							
+    .quad 0x3ff1c664d0000000
+	.quad 0x3e46fcd22d4e3c1e							
+    .quad 0x3ff1cab610000000
+	.quad 0x3e4e0c60b409e863							
+    .quad 0x3ff1cf0530000000
+	.quad 0x3e6f9cab5a5f0333							
+    .quad 0x3ff1d35230000000
+	.quad 0x3e630f24744c333d							
+    .quad 0x3ff1d79d30000000
+	.quad 0x3e4b50622a76b2fe							
+    .quad 0x3ff1dbe620000000
+	.quad 0x3e6fdb94ba595375							
+    .quad 0x3ff1e02cf0000000
+	.quad 0x3e3861b9b945a171							
+    .quad 0x3ff1e471d0000000
+	.quad 0x3e654348015188c4							
+    .quad 0x3ff1e8b490000000
+	.quad 0x3e6b54d149865523							
+    .quad 0x3ff1ecf550000000
+	.quad 0x3e6a0bb783d9de33							
+    .quad 0x3ff1f13410000000
+	.quad 0x3e6629d12b1a2157							
+    .quad 0x3ff1f570d0000000
+	.quad 0x3e6467fe35d179df							
+    .quad 0x3ff1f9ab90000000
+	.quad 0x3e69763f3e26c8f7							
+    .quad 0x3ff1fde450000000
+	.quad 0x3e53f798bb9f7679							
+    .quad 0x3ff2021b20000000
+	.quad 0x3e552e577e855898							
+    .quad 0x3ff2064ff0000000
+	.quad 0x3e6fde47e5502c3a							
+    .quad 0x3ff20a82c0000000
+	.quad 0x3e5cbd0b548d96a0							
+    .quad 0x3ff20eb3b0000000
+	.quad 0x3e6a9cd9f7be8de8							
+    .quad 0x3ff212e2a0000000
+	.quad 0x3e522bbe704886de							
+    .quad 0x3ff2170fb0000000
+	.quad 0x3e6e3dea8317f020							
+    .quad 0x3ff21b3ac0000000
+	.quad 0x3e6e812085ac8855							
+    .quad 0x3ff21f63f0000000
+	.quad 0x3e5c87144f24cb07							
+    .quad 0x3ff2238b40000000
+	.quad 0x3e61e128ee311fa2							
+    .quad 0x3ff227b0a0000000
+	.quad 0x3e5b5c163d61a2d3							
+    .quad 0x3ff22bd420000000
+	.quad 0x3e47d97e7fb90633							
+    .quad 0x3ff22ff5c0000000
+	.quad 0x3e6efe899d50f6a7							
+    .quad 0x3ff2341570000000
+	.quad 0x3e6d0333eb75de5a							
+    .quad 0x3ff2383350000000
+	.quad 0x3e40e590be73a573							
+    .quad 0x3ff23c4f60000000
+	.quad 0x3e68ce8dcac3cdd2							
+    .quad 0x3ff2406980000000
+	.quad 0x3e6ee8a48954064b							
+    .quad 0x3ff24481d0000000
+	.quad 0x3e6aa62f18461e09							
+    .quad 0x3ff2489850000000
+	.quad 0x3e601e5940986a15							
+    .quad 0x3ff24cad00000000
+	.quad 0x3e3b082f4f9b8d4c							
+    .quad 0x3ff250bfe0000000
+	.quad 0x3e6876e0e5527f5a							
+    .quad 0x3ff254d0e0000000
+	.quad 0x3e63617080831e6b							
+    .quad 0x3ff258e020000000
+	.quad 0x3e681b26e34aa4a2							
+    .quad 0x3ff25ced90000000
+	.quad 0x3e552ee66dfab0c1							
+    .quad 0x3ff260f940000000
+	.quad 0x3e5d85a5329e8819							
+    .quad 0x3ff2650320000000
+	.quad 0x3e5105c1b646b5d1							
+    .quad 0x3ff2690b40000000
+	.quad 0x3e6bb6690c1a379c							
+    .quad 0x3ff26d1190000000
+	.quad 0x3e586aeba73ce3a9							
+    .quad 0x3ff2711630000000
+	.quad 0x3e6dd16198294dd4							
+    .quad 0x3ff2751900000000
+	.quad 0x3e6454e675775e83							
+    .quad 0x3ff2791a20000000
+	.quad 0x3e63842e026197ea							
+    .quad 0x3ff27d1980000000
+	.quad 0x3e6f1ce0e70c44d2							
+    .quad 0x3ff2811720000000
+	.quad 0x3e6ad636441a5627							
+    .quad 0x3ff2851310000000
+	.quad 0x3e54c205d7212abb							
+    .quad 0x3ff2890d50000000
+	.quad 0x3e6167c86c116419							
+    .quad 0x3ff28d05d0000000
+	.quad 0x3e638ec3ef16e294							
+    .quad 0x3ff290fca0000000
+	.quad 0x3e6473fceace9321							
+    .quad 0x3ff294f1c0000000
+	.quad 0x3e67af53a836dba7							
+    .quad 0x3ff298e530000000
+	.quad 0x3e1a51f3c383b652							
+    .quad 0x3ff29cd700000000
+	.quad 0x3e63696da190822d							
+    .quad 0x3ff2a0c710000000
+	.quad 0x3e62f9adec77074b							
+    .quad 0x3ff2a4b580000000
+	.quad 0x3e38190fd5bee55f							
+    .quad 0x3ff2a8a250000000
+	.quad 0x3e4bfee8fac68e55							
+    .quad 0x3ff2ac8d70000000
+	.quad 0x3e331c9d6bc5f68a							
+    .quad 0x3ff2b076f0000000
+	.quad 0x3e689d0523737edf							
+    .quad 0x3ff2b45ec0000000
+	.quad 0x3e5a295943bf47bb							
+    .quad 0x3ff2b84500000000
+	.quad 0x3e396be32e5b3207							
+    .quad 0x3ff2bc29a0000000
+	.quad 0x3e6e44c7d909fa0e							
+    .quad 0x3ff2c00c90000000
+	.quad 0x3e2b2505da94d9ea							
+    .quad 0x3ff2c3ee00000000
+	.quad 0x3e60c851f46c9c98							
+    .quad 0x3ff2c7cdc0000000
+	.quad 0x3e5da71f7d9aa3b7							
+    .quad 0x3ff2cbabf0000000
+	.quad 0x3e6f1b605d019ef1							
+    .quad 0x3ff2cf8880000000
+	.quad 0x3e4386e8a2189563							
+    .quad 0x3ff2d36390000000
+	.quad 0x3e3b19fa5d306ba7							
+    .quad 0x3ff2d73d00000000
+	.quad 0x3e6dd749b67aef76							
+    .quad 0x3ff2db14d0000000
+	.quad 0x3e676ff6f1dc04b0							
+    .quad 0x3ff2deeb20000000
+	.quad 0x3e635a33d0b232a6							
+    .quad 0x3ff2e2bfe0000000
+	.quad 0x3e64bdc80024a4e1							
+    .quad 0x3ff2e69310000000
+	.quad 0x3e6ebd61770fd723							
+    .quad 0x3ff2ea64b0000000
+	.quad 0x3e64769fc537264d							
+    .quad 0x3ff2ee34d0000000
+	.quad 0x3e69021f429f3b98							
+    .quad 0x3ff2f20360000000
+	.quad 0x3e5ee7083efbd606							
+    .quad 0x3ff2f5d070000000
+	.quad 0x3e6ad985552a6b1a							
+    .quad 0x3ff2f99bf0000000
+	.quad 0x3e6e3df778772160							
+    .quad 0x3ff2fd65f0000000
+	.quad 0x3e6ca5d76ddc9b34							
+    .quad 0x3ff3012e70000000
+	.quad 0x3e691154ffdbaf74							
+    .quad 0x3ff304f570000000
+	.quad 0x3e667bdd57fb306a							
+    .quad 0x3ff308baf0000000
+	.quad 0x3e67dc255ac40886							
+    .quad 0x3ff30c7ef0000000
+	.quad 0x3df219f38e8afafe							
+    .quad 0x3ff3104180000000
+	.quad 0x3e62416bf9669a04							
+    .quad 0x3ff3140280000000
+	.quad 0x3e611c96b2b3987f							
+    .quad 0x3ff317c210000000
+	.quad 0x3e6f99ed447e1177							
+    .quad 0x3ff31b8020000000
+	.quad 0x3e13245826328a11							
+    .quad 0x3ff31f3cd0000000
+	.quad 0x3e66f56dd1e645f8							
+    .quad 0x3ff322f7f0000000
+	.quad 0x3e46164946945535							
+    .quad 0x3ff326b1b0000000
+	.quad 0x3e5e37d59d190028							
+    .quad 0x3ff32a69f0000000
+	.quad 0x3e668671f12bf828							
+    .quad 0x3ff32e20c0000000
+	.quad 0x3e6e8ecbca6aabbd							
+    .quad 0x3ff331d620000000
+	.quad 0x3e53f49e109a5912							
+    .quad 0x3ff3358a20000000
+	.quad 0x3e6b8a0e11ec3043							
+    .quad 0x3ff3393ca0000000
+	.quad 0x3e65fae00aed691a							
+    .quad 0x3ff33cedc0000000
+	.quad 0x3e6c0569bece3e4a							
+    .quad 0x3ff3409d70000000
+	.quad 0x3e605e26744efbfe							
+    .quad 0x3ff3444bc0000000
+	.quad 0x3e65b570a94be5c5							
+    .quad 0x3ff347f8a0000000
+	.quad 0x3e5d6f156ea0e063							
+    .quad 0x3ff34ba420000000
+	.quad 0x3e6e0ca7612fc484							
+    .quad 0x3ff34f4e30000000
+	.quad 0x3e4963c927b25258							
+    .quad 0x3ff352f6f0000000
+	.quad 0x3e547930aa725a5c							
+    .quad 0x3ff3569e40000000
+	.quad 0x3e58a79fe3af43b3							
+    .quad 0x3ff35a4430000000
+	.quad 0x3e5e6dc29c41bdaf							
+    .quad 0x3ff35de8c0000000
+	.quad 0x3e657a2e76f863a5							
+    .quad 0x3ff3618bf0000000
+	.quad 0x3e2ae3b61716354d							
+    .quad 0x3ff3652dd0000000
+	.quad 0x3e665fb5df6906b1							
+    .quad 0x3ff368ce40000000
+	.quad 0x3e66177d7f588f7b							
+    .quad 0x3ff36c6d60000000
+	.quad 0x3e3ad55abd091b67							
+    .quad 0x3ff3700b30000000
+	.quad 0x3e155337b2422d76							
+    .quad 0x3ff373a7a0000000
+	.quad 0x3e6084ebe86972d5							
+    .quad 0x3ff37742b0000000
+	.quad 0x3e656395808e1ea3							
+    .quad 0x3ff37adc70000000
+	.quad 0x3e61bce21b40fba7							
+    .quad 0x3ff37e74e0000000
+	.quad 0x3e5006f94605b515							
+    .quad 0x3ff3820c00000000
+	.quad 0x3e6aa676aceb1f7d							
+    .quad 0x3ff385a1c0000000
+	.quad 0x3e58229f76554ce6							
+    .quad 0x3ff3893640000000
+	.quad 0x3e6eabfc6cf57330							
+    .quad 0x3ff38cc960000000
+	.quad 0x3e64daed9c0ce8bc							
+    .quad 0x3ff3905b40000000
+	.quad 0x3e60ff1768237141							
+    .quad 0x3ff393ebd0000000
+	.quad 0x3e6575f83051b085							
+    .quad 0x3ff3977b10000000
+	.quad 0x3e42667deb523e29							
+    .quad 0x3ff39b0910000000
+	.quad 0x3e1816996954f4fd							
+    .quad 0x3ff39e95c0000000
+	.quad 0x3e587cfccf4d9cd4							
+    .quad 0x3ff3a22120000000
+	.quad 0x3e52c5d018198353							
+    .quad 0x3ff3a5ab40000000
+	.quad 0x3e6a7a898dcc34aa							
+    .quad 0x3ff3a93410000000
+	.quad 0x3e2cead6dadc36d1							
+    .quad 0x3ff3acbbb0000000
+	.quad 0x3e2a55759c498bdf							
+    .quad 0x3ff3b04200000000
+	.quad 0x3e6c414a9ef6de04							
+    .quad 0x3ff3b3c700000000
+	.quad 0x3e63e2108a6e58fa							
+    .quad 0x3ff3b74ad0000000
+	.quad 0x3e5587fd7643d77c							
+    .quad 0x3ff3bacd60000000
+	.quad 0x3e3901eb1d3ff3df							
+    .quad 0x3ff3be4eb0000000
+	.quad 0x3e6f2ccd7c812fc6							
+    .quad 0x3ff3c1ceb0000000
+	.quad 0x3e21c8ee70a01049							
+    .quad 0x3ff3c54d90000000
+	.quad 0x3e563e8d02831eec							
+    .quad 0x3ff3c8cb20000000
+	.quad 0x3e6f61a42a92c7ff
+    .quad 0x3ff3cc4770000000
+	.quad 0x3dda917399c84d24
+    .quad 0x3ff3cfc2a0000000
+	.quad 0x3e5e9197c8eec2f0
+    .quad 0x3ff3d33c80000000
+	.quad 0x3e5e6f842f5a1378
+    .quad 0x3ff3d6b530000000
+	.quad 0x3e2fac242a90a0fc	
+    .quad 0x3ff3da2cb0000000
+	.quad 0x3e535ed726610227
+    .quad 0x3ff3dda2f0000000
+	.quad 0x3e50e0d64804b15b							
+    .quad 0x3ff3e11800000000
+	.quad 0x3e0560675daba814
+    .quad 0x3ff3e48be0000000
+	.quad 0x3e637388c8768032
+    .quad 0x3ff3e7fe80000000
+	.quad 0x3e3ee3c89f9e01f5
+    .quad 0x3ff3eb7000000000
+	.quad 0x3e639f6f0d09747c
+    .quad 0x3ff3eee040000000
+	.quad 0x3e4322c327abb8f0
+    .quad 0x3ff3f24f60000000
+	.quad 0x3e6961b347c8ac80
+    .quad 0x3ff3f5bd40000000
+	.quad 0x3e63711fbbd0f118
+    .quad 0x3ff3f92a00000000
+	.quad 0x3e64fad8d7718ffb
+    .quad 0x3ff3fc9590000000
+	.quad 0x3e6fffffffffffff	
+    .quad 0x3ff3fffff0000000
+	.quad 0x3e667efa79ec35b4
+    .quad 0x3ff4036930000000
+	.quad 0x3e6a737687a254a8
+    .quad 0x3ff406d140000000
+	.quad 0x3e5bace0f87d924d
+    .quad 0x3ff40a3830000000
+	.quad 0x3e629e37c237e392
+    .quad 0x3ff40d9df0000000
+	.quad 0x3e557ce7ac3f3012
+    .quad 0x3ff4110290000000
+	.quad 0x3e682829359f8fbd	
+    .quad 0x3ff4146600000000
+	.quad 0x3e6cc9be42d14676	
+    .quad 0x3ff417c850000000
+	.quad 0x3e6a8f001c137d0b	
+    .quad 0x3ff41b2980000000
+	.quad 0x3e636127687dda05	
+    .quad 0x3ff41e8990000000
+	.quad 0x3e524dba322646f0
+    .quad 0x3ff421e880000000
+	.quad 0x3e6dc43f1ed210b4	
+    .quad 0x3ff4254640000000
+	.quad 0x3e631ae515c447bb
+    .quad 0x3ff428a2f0000000
+                         
+
+.align 32
+.L__CBRT_F_H_256:   .quad 0x3ff0000000000000
+					.quad 0x3ff0055380000000
+					.quad 0x3ff00aa390000000
+					.quad 0x3ff00ff010000000
+					.quad 0x3ff0153920000000
+					.quad 0x3ff01a7eb0000000
+					.quad 0x3ff01fc0d0000000
+					.quad 0x3ff024ff80000000
+					.quad 0x3ff02a3ad0000000
+					.quad 0x3ff02f72b0000000
+					.quad 0x3ff034a750000000
+					.quad 0x3ff039d880000000
+					.quad 0x3ff03f0670000000
+					.quad 0x3ff0443110000000
+					.quad 0x3ff0495870000000
+					.quad 0x3ff04e7c80000000
+					.quad 0x3ff0539d60000000
+					.quad 0x3ff058bb00000000
+					.quad 0x3ff05dd570000000
+					.quad 0x3ff062ecc0000000
+					.quad 0x3ff06800e0000000
+					.quad 0x3ff06d11e0000000
+					.quad 0x3ff0721fc0000000
+					.quad 0x3ff0772a80000000
+					.quad 0x3ff07c3230000000
+					.quad 0x3ff08136d0000000
+					.quad 0x3ff0863860000000
+					.quad 0x3ff08b36f0000000
+					.quad 0x3ff0903280000000
+					.quad 0x3ff0952b10000000
+					.quad 0x3ff09a20a0000000
+					.quad 0x3ff09f1340000000
+					.quad 0x3ff0a402f0000000
+					.quad 0x3ff0a8efc0000000
+					.quad 0x3ff0add990000000
+					.quad 0x3ff0b2c090000000
+					.quad 0x3ff0b7a4b0000000
+					.quad 0x3ff0bc85f0000000
+					.quad 0x3ff0c16450000000
+					.quad 0x3ff0c63fe0000000
+					.quad 0x3ff0cb18b0000000
+					.quad 0x3ff0cfeeb0000000
+					.quad 0x3ff0d4c1e0000000
+					.quad 0x3ff0d99250000000
+					.quad 0x3ff0de6010000000
+					.quad 0x3ff0e32b00000000
+					.quad 0x3ff0e7f340000000
+					.quad 0x3ff0ecb8d0000000
+					.quad 0x3ff0f17bb0000000
+					.quad 0x3ff0f63bf0000000
+					.quad 0x3ff0faf970000000
+					.quad 0x3ff0ffb460000000
+					.quad 0x3ff1046cb0000000
+					.quad 0x3ff1092250000000
+					.quad 0x3ff10dd560000000
+					.quad 0x3ff11285e0000000
+					.quad 0x3ff11733d0000000
+					.quad 0x3ff11bdf30000000
+					.quad 0x3ff1208800000000
+					.quad 0x3ff1252e40000000
+					.quad 0x3ff129d210000000
+					.quad 0x3ff12e7350000000
+					.quad 0x3ff1331210000000
+					.quad 0x3ff137ae60000000
+					.quad 0x3ff13c4840000000
+					.quad 0x3ff140dfa0000000
+					.quad 0x3ff1457490000000
+					.quad 0x3ff14a0710000000
+					.quad 0x3ff14e9730000000
+					.quad 0x3ff15324e0000000
+					.quad 0x3ff157b030000000
+					.quad 0x3ff15c3920000000
+					.quad 0x3ff160bfc0000000
+					.quad 0x3ff16543f0000000
+					.quad 0x3ff169c5d0000000
+					.quad 0x3ff16e4560000000
+					.quad 0x3ff172c2a0000000
+					.quad 0x3ff1773d90000000
+					.quad 0x3ff17bb630000000
+					.quad 0x3ff1802c90000000
+					.quad 0x3ff184a0a0000000
+					.quad 0x3ff1891270000000
+					.quad 0x3ff18d8210000000
+					.quad 0x3ff191ef60000000
+					.quad 0x3ff1965a80000000
+					.quad 0x3ff19ac360000000
+					.quad 0x3ff19f2a10000000
+					.quad 0x3ff1a38e90000000
+					.quad 0x3ff1a7f0e0000000
+					.quad 0x3ff1ac5100000000
+					.quad 0x3ff1b0af00000000
+					.quad 0x3ff1b50ad0000000
+					.quad 0x3ff1b96480000000
+					.quad 0x3ff1bdbc10000000
+					.quad 0x3ff1c21180000000
+					.quad 0x3ff1c664d0000000
+					.quad 0x3ff1cab610000000
+					.quad 0x3ff1cf0530000000
+					.quad 0x3ff1d35230000000
+					.quad 0x3ff1d79d30000000
+					.quad 0x3ff1dbe620000000
+					.quad 0x3ff1e02cf0000000
+					.quad 0x3ff1e471d0000000
+					.quad 0x3ff1e8b490000000
+					.quad 0x3ff1ecf550000000
+					.quad 0x3ff1f13410000000
+					.quad 0x3ff1f570d0000000
+					.quad 0x3ff1f9ab90000000
+					.quad 0x3ff1fde450000000
+					.quad 0x3ff2021b20000000
+					.quad 0x3ff2064ff0000000
+					.quad 0x3ff20a82c0000000
+					.quad 0x3ff20eb3b0000000
+					.quad 0x3ff212e2a0000000
+					.quad 0x3ff2170fb0000000
+					.quad 0x3ff21b3ac0000000
+					.quad 0x3ff21f63f0000000
+					.quad 0x3ff2238b40000000
+					.quad 0x3ff227b0a0000000
+					.quad 0x3ff22bd420000000
+					.quad 0x3ff22ff5c0000000
+					.quad 0x3ff2341570000000
+					.quad 0x3ff2383350000000
+					.quad 0x3ff23c4f60000000
+					.quad 0x3ff2406980000000
+					.quad 0x3ff24481d0000000
+					.quad 0x3ff2489850000000
+					.quad 0x3ff24cad00000000
+					.quad 0x3ff250bfe0000000
+					.quad 0x3ff254d0e0000000
+					.quad 0x3ff258e020000000
+					.quad 0x3ff25ced90000000
+					.quad 0x3ff260f940000000
+					.quad 0x3ff2650320000000
+					.quad 0x3ff2690b40000000
+					.quad 0x3ff26d1190000000
+					.quad 0x3ff2711630000000
+					.quad 0x3ff2751900000000
+					.quad 0x3ff2791a20000000
+					.quad 0x3ff27d1980000000
+					.quad 0x3ff2811720000000
+					.quad 0x3ff2851310000000
+					.quad 0x3ff2890d50000000
+					.quad 0x3ff28d05d0000000
+					.quad 0x3ff290fca0000000
+					.quad 0x3ff294f1c0000000
+					.quad 0x3ff298e530000000
+					.quad 0x3ff29cd700000000
+					.quad 0x3ff2a0c710000000
+					.quad 0x3ff2a4b580000000
+					.quad 0x3ff2a8a250000000
+					.quad 0x3ff2ac8d70000000
+					.quad 0x3ff2b076f0000000
+					.quad 0x3ff2b45ec0000000
+					.quad 0x3ff2b84500000000
+					.quad 0x3ff2bc29a0000000
+					.quad 0x3ff2c00c90000000
+					.quad 0x3ff2c3ee00000000
+					.quad 0x3ff2c7cdc0000000
+					.quad 0x3ff2cbabf0000000
+					.quad 0x3ff2cf8880000000
+					.quad 0x3ff2d36390000000
+					.quad 0x3ff2d73d00000000
+					.quad 0x3ff2db14d0000000
+					.quad 0x3ff2deeb20000000
+					.quad 0x3ff2e2bfe0000000
+					.quad 0x3ff2e69310000000
+					.quad 0x3ff2ea64b0000000
+					.quad 0x3ff2ee34d0000000
+					.quad 0x3ff2f20360000000
+					.quad 0x3ff2f5d070000000
+					.quad 0x3ff2f99bf0000000
+					.quad 0x3ff2fd65f0000000
+					.quad 0x3ff3012e70000000
+					.quad 0x3ff304f570000000
+					.quad 0x3ff308baf0000000
+					.quad 0x3ff30c7ef0000000
+					.quad 0x3ff3104180000000
+					.quad 0x3ff3140280000000
+					.quad 0x3ff317c210000000
+					.quad 0x3ff31b8020000000
+					.quad 0x3ff31f3cd0000000
+					.quad 0x3ff322f7f0000000
+					.quad 0x3ff326b1b0000000
+					.quad 0x3ff32a69f0000000
+					.quad 0x3ff32e20c0000000
+					.quad 0x3ff331d620000000
+					.quad 0x3ff3358a20000000
+					.quad 0x3ff3393ca0000000
+					.quad 0x3ff33cedc0000000
+					.quad 0x3ff3409d70000000
+					.quad 0x3ff3444bc0000000
+					.quad 0x3ff347f8a0000000
+					.quad 0x3ff34ba420000000
+					.quad 0x3ff34f4e30000000
+					.quad 0x3ff352f6f0000000
+					.quad 0x3ff3569e40000000
+					.quad 0x3ff35a4430000000
+					.quad 0x3ff35de8c0000000
+					.quad 0x3ff3618bf0000000
+					.quad 0x3ff3652dd0000000
+					.quad 0x3ff368ce40000000
+					.quad 0x3ff36c6d60000000
+					.quad 0x3ff3700b30000000
+					.quad 0x3ff373a7a0000000
+					.quad 0x3ff37742b0000000
+					.quad 0x3ff37adc70000000
+					.quad 0x3ff37e74e0000000
+					.quad 0x3ff3820c00000000
+					.quad 0x3ff385a1c0000000
+					.quad 0x3ff3893640000000
+					.quad 0x3ff38cc960000000
+					.quad 0x3ff3905b40000000
+					.quad 0x3ff393ebd0000000
+					.quad 0x3ff3977b10000000
+					.quad 0x3ff39b0910000000
+					.quad 0x3ff39e95c0000000
+					.quad 0x3ff3a22120000000
+					.quad 0x3ff3a5ab40000000
+					.quad 0x3ff3a93410000000
+					.quad 0x3ff3acbbb0000000
+					.quad 0x3ff3b04200000000
+					.quad 0x3ff3b3c700000000
+					.quad 0x3ff3b74ad0000000
+					.quad 0x3ff3bacd60000000
+					.quad 0x3ff3be4eb0000000
+					.quad 0x3ff3c1ceb0000000
+					.quad 0x3ff3c54d90000000
+					.quad 0x3ff3c8cb20000000
+					.quad 0x3ff3cc4770000000
+					.quad 0x3ff3cfc2a0000000
+					.quad 0x3ff3d33c80000000
+					.quad 0x3ff3d6b530000000
+					.quad 0x3ff3da2cb0000000
+					.quad 0x3ff3dda2f0000000
+					.quad 0x3ff3e11800000000
+					.quad 0x3ff3e48be0000000
+					.quad 0x3ff3e7fe80000000
+					.quad 0x3ff3eb7000000000
+					.quad 0x3ff3eee040000000
+					.quad 0x3ff3f24f60000000
+					.quad 0x3ff3f5bd40000000
+					.quad 0x3ff3f92a00000000
+					.quad 0x3ff3fc9590000000
+					.quad 0x3ff3fffff0000000
+					.quad 0x3ff4036930000000
+					.quad 0x3ff406d140000000
+					.quad 0x3ff40a3830000000
+					.quad 0x3ff40d9df0000000
+					.quad 0x3ff4110290000000
+					.quad 0x3ff4146600000000
+					.quad 0x3ff417c850000000
+					.quad 0x3ff41b2980000000
+					.quad 0x3ff41e8990000000
+					.quad 0x3ff421e880000000
+					.quad 0x3ff4254640000000
+
+.align 32
+.L__CBRT_F_T_256:	.quad 0x0000000000000000
+					.quad 0x3e6e6a24c81e4294
+					.quad 0x3e58548511e3a785
+					.quad 0x3e64eb9336ec07f6
+					.quad 0x3e40ea64b8b750e1
+					.quad 0x3e461637cff8a53c
+					.quad 0x3e40733bf7bd1943
+					.quad 0x3e5666911345cced
+					.quad 0x3e477b7a3f592f14
+					.quad 0x3e6f18d3dd1a5402
+					.quad 0x3e2be2f5a58ee9a4
+					.quad 0x3e68901f8f085fa7
+					.quad 0x3e5c68b8cd5b5d69
+					.quad 0x3e5a6b0e8624be42
+					.quad 0x3dbc4b22b06f68e7
+					.quad 0x3e60f3f0afcabe9b
+					.quad 0x3e548495bca4e1b7
+					.quad 0x3e66107f1abdfdc3
+					.quad 0x3e6e67261878288a
+					.quad 0x3e5a6bc155286f1e
+					.quad 0x3e58a759c64a85f2
+					.quad 0x3e45fce70a4a8d09
+					.quad 0x3e32f9cbf373fe1d
+					.quad 0x3e590564ce4ac359
+					.quad 0x3e5ac29ce761b02f
+					.quad 0x3e5cb752f497381c
+					.quad 0x3e68bb9e1cfb35e0
+					.quad 0x3e65b4917099de90
+					.quad 0x3e5cc77ac9c65ef2
+					.quad 0x3e57a0f3e7be3dba
+					.quad 0x3e66ec851ee0c16f
+					.quad 0x3e689449bf2946da
+					.quad 0x3e698f25301ba223
+					.quad 0x3e347d5ec651f549
+					.quad 0x3e6c33ec9a86007a
+					.quad 0x3e5e0b6653e92649
+					.quad 0x3e3bd64ac09d755f
+					.quad 0x3e2f537506f78167
+					.quad 0x3e62c382d1b3735e
+					.quad 0x3e6e20ed659f99e1
+					.quad 0x3e586b633a9c182a
+					.quad 0x3e445cfd5a65e777
+					.quad 0x3e60c8770f58bca4
+					.quad 0x3e6739e44b0933c5
+					.quad 0x3e027dc3d9ce7bd8
+					.quad 0x3e63c53c7c5a7b64
+					.quad 0x3e69669683830cec
+					.quad 0x3e68d772c39bdcc4
+					.quad 0x3e69b0008bcf6d7b
+					.quad 0x3e3bbb305825ce4f
+					.quad 0x3e6da3f4af13a406
+					.quad 0x3e5f36b96f74ce86
+					.quad 0x3e165c002303f790
+					.quad 0x3e682f84095ba7d5
+					.quad 0x3e6d46433541b2c6
+					.quad 0x3e671c3d56e93a89
+					.quad 0x3e598dcef4e40012
+					.quad 0x3e4530ebef17fe03
+					.quad 0x3e4e8b8fa3715066
+					.quad 0x3e6ab26eb3b211dc
+					.quad 0x3e454dd4dc906307
+					.quad 0x3e5c9f962387984e
+					.quad 0x3e6c62a959afec09
+					.quad 0x3e6638d9ac6a866a
+					.quad 0x3e338704eca8a22d
+					.quad 0x3e4e6c9e1db14f8f
+					.quad 0x3e58744b7f9c9eaa
+					.quad 0x3e66c2893486373b
+					.quad 0x3e5b36bce31699b7
+					.quad 0x3e671e3813d200c7
+					.quad 0x3e699755ab40aa88
+					.quad 0x3e6b45ca0e4bcfc0
+					.quad 0x3e32dd090d869c5d
+					.quad 0x3e64fe0516b917da
+					.quad 0x3e694563226317a2
+					.quad 0x3e653d8fafc2c851
+					.quad 0x3e5dcbd41fbd41a3
+					.quad 0x3e5862ff5285f59c
+					.quad 0x3e63072ea97a1e1c
+					.quad 0x3e52839075184805
+					.quad 0x3e64b0323e9eff42
+					.quad 0x3e6b158893c45484
+					.quad 0x3e3149ef0fc35826
+					.quad 0x3e5f2e77ea96acaa
+					.quad 0x3e5200074c471a95
+					.quad 0x3e63f8cc517f6f04
+					.quad 0x3e660ba2e311bb55
+					.quad 0x3e64b788730bbec3
+					.quad 0x3e657090795ee20c
+					.quad 0x3e6d9ffe983670b1
+					.quad 0x3e62a463ff61bfda
+					.quad 0x3e69d1bc6a5e65cf
+					.quad 0x3e68718abaa9e922
+					.quad 0x3e63c2f52ffa342e
+					.quad 0x3e60fae13ff42c80
+					.quad 0x3e65440f0ef00d57
+					.quad 0x3e46fcd22d4e3c1e
+					.quad 0x3e4e0c60b409e863
+					.quad 0x3e6f9cab5a5f0333
+					.quad 0x3e630f24744c333d
+					.quad 0x3e4b50622a76b2fe
+					.quad 0x3e6fdb94ba595375
+					.quad 0x3e3861b9b945a171
+					.quad 0x3e654348015188c4
+					.quad 0x3e6b54d149865523
+					.quad 0x3e6a0bb783d9de33
+					.quad 0x3e6629d12b1a2157
+					.quad 0x3e6467fe35d179df
+					.quad 0x3e69763f3e26c8f7
+					.quad 0x3e53f798bb9f7679
+					.quad 0x3e552e577e855898
+					.quad 0x3e6fde47e5502c3a
+					.quad 0x3e5cbd0b548d96a0
+					.quad 0x3e6a9cd9f7be8de8
+					.quad 0x3e522bbe704886de
+					.quad 0x3e6e3dea8317f020
+					.quad 0x3e6e812085ac8855
+					.quad 0x3e5c87144f24cb07
+					.quad 0x3e61e128ee311fa2
+					.quad 0x3e5b5c163d61a2d3
+					.quad 0x3e47d97e7fb90633
+					.quad 0x3e6efe899d50f6a7
+					.quad 0x3e6d0333eb75de5a
+					.quad 0x3e40e590be73a573
+					.quad 0x3e68ce8dcac3cdd2
+					.quad 0x3e6ee8a48954064b
+					.quad 0x3e6aa62f18461e09
+					.quad 0x3e601e5940986a15
+					.quad 0x3e3b082f4f9b8d4c
+					.quad 0x3e6876e0e5527f5a
+					.quad 0x3e63617080831e6b
+					.quad 0x3e681b26e34aa4a2
+					.quad 0x3e552ee66dfab0c1
+					.quad 0x3e5d85a5329e8819
+					.quad 0x3e5105c1b646b5d1
+					.quad 0x3e6bb6690c1a379c
+					.quad 0x3e586aeba73ce3a9
+					.quad 0x3e6dd16198294dd4
+					.quad 0x3e6454e675775e83
+					.quad 0x3e63842e026197ea
+					.quad 0x3e6f1ce0e70c44d2
+					.quad 0x3e6ad636441a5627
+					.quad 0x3e54c205d7212abb
+					.quad 0x3e6167c86c116419
+					.quad 0x3e638ec3ef16e294
+					.quad 0x3e6473fceace9321
+					.quad 0x3e67af53a836dba7
+					.quad 0x3e1a51f3c383b652
+					.quad 0x3e63696da190822d
+					.quad 0x3e62f9adec77074b
+					.quad 0x3e38190fd5bee55f
+					.quad 0x3e4bfee8fac68e55
+					.quad 0x3e331c9d6bc5f68a
+					.quad 0x3e689d0523737edf
+					.quad 0x3e5a295943bf47bb
+					.quad 0x3e396be32e5b3207
+					.quad 0x3e6e44c7d909fa0e
+					.quad 0x3e2b2505da94d9ea
+					.quad 0x3e60c851f46c9c98
+					.quad 0x3e5da71f7d9aa3b7
+					.quad 0x3e6f1b605d019ef1
+					.quad 0x3e4386e8a2189563
+					.quad 0x3e3b19fa5d306ba7
+					.quad 0x3e6dd749b67aef76
+					.quad 0x3e676ff6f1dc04b0
+					.quad 0x3e635a33d0b232a6
+					.quad 0x3e64bdc80024a4e1
+					.quad 0x3e6ebd61770fd723
+					.quad 0x3e64769fc537264d
+					.quad 0x3e69021f429f3b98
+					.quad 0x3e5ee7083efbd606
+					.quad 0x3e6ad985552a6b1a
+					.quad 0x3e6e3df778772160
+					.quad 0x3e6ca5d76ddc9b34
+					.quad 0x3e691154ffdbaf74
+					.quad 0x3e667bdd57fb306a
+					.quad 0x3e67dc255ac40886
+					.quad 0x3df219f38e8afafe
+					.quad 0x3e62416bf9669a04
+					.quad 0x3e611c96b2b3987f
+					.quad 0x3e6f99ed447e1177
+					.quad 0x3e13245826328a11
+					.quad 0x3e66f56dd1e645f8
+					.quad 0x3e46164946945535
+					.quad 0x3e5e37d59d190028
+					.quad 0x3e668671f12bf828
+					.quad 0x3e6e8ecbca6aabbd
+					.quad 0x3e53f49e109a5912
+					.quad 0x3e6b8a0e11ec3043
+					.quad 0x3e65fae00aed691a
+					.quad 0x3e6c0569bece3e4a
+					.quad 0x3e605e26744efbfe
+					.quad 0x3e65b570a94be5c5
+					.quad 0x3e5d6f156ea0e063
+					.quad 0x3e6e0ca7612fc484
+					.quad 0x3e4963c927b25258
+					.quad 0x3e547930aa725a5c
+					.quad 0x3e58a79fe3af43b3
+					.quad 0x3e5e6dc29c41bdaf
+					.quad 0x3e657a2e76f863a5
+					.quad 0x3e2ae3b61716354d
+					.quad 0x3e665fb5df6906b1
+					.quad 0x3e66177d7f588f7b
+					.quad 0x3e3ad55abd091b67
+					.quad 0x3e155337b2422d76
+					.quad 0x3e6084ebe86972d5
+					.quad 0x3e656395808e1ea3
+					.quad 0x3e61bce21b40fba7
+					.quad 0x3e5006f94605b515
+					.quad 0x3e6aa676aceb1f7d
+					.quad 0x3e58229f76554ce6
+					.quad 0x3e6eabfc6cf57330
+					.quad 0x3e64daed9c0ce8bc
+					.quad 0x3e60ff1768237141
+					.quad 0x3e6575f83051b085
+					.quad 0x3e42667deb523e29
+					.quad 0x3e1816996954f4fd
+					.quad 0x3e587cfccf4d9cd4
+					.quad 0x3e52c5d018198353
+					.quad 0x3e6a7a898dcc34aa
+					.quad 0x3e2cead6dadc36d1
+					.quad 0x3e2a55759c498bdf
+					.quad 0x3e6c414a9ef6de04
+					.quad 0x3e63e2108a6e58fa
+					.quad 0x3e5587fd7643d77c
+					.quad 0x3e3901eb1d3ff3df
+					.quad 0x3e6f2ccd7c812fc6
+					.quad 0x3e21c8ee70a01049
+					.quad 0x3e563e8d02831eec
+					.quad 0x3e6f61a42a92c7ff
+					.quad 0x3dda917399c84d24
+					.quad 0x3e5e9197c8eec2f0
+					.quad 0x3e5e6f842f5a1378
+					.quad 0x3e2fac242a90a0fc
+					.quad 0x3e535ed726610227
+					.quad 0x3e50e0d64804b15b
+					.quad 0x3e0560675daba814
+					.quad 0x3e637388c8768032
+					.quad 0x3e3ee3c89f9e01f5
+					.quad 0x3e639f6f0d09747c
+					.quad 0x3e4322c327abb8f0
+					.quad 0x3e6961b347c8ac80
+					.quad 0x3e63711fbbd0f118
+					.quad 0x3e64fad8d7718ffb
+					.quad 0x3e6fffffffffffff
+					.quad 0x3e667efa79ec35b4
+					.quad 0x3e6a737687a254a8
+					.quad 0x3e5bace0f87d924d
+					.quad 0x3e629e37c237e392
+					.quad 0x3e557ce7ac3f3012
+					.quad 0x3e682829359f8fbd
+					.quad 0x3e6cc9be42d14676
+					.quad 0x3e6a8f001c137d0b
+					.quad 0x3e636127687dda05
+					.quad 0x3e524dba322646f0
+					.quad 0x3e6dc43f1ed210b4
+
+.align 32
+.L__INV_TAB_256:    .quad 0x4000000000000000
+					.quad 0x3fffe01fe01fe020
+					.quad 0x3fffc07f01fc07f0
+					.quad 0x3fffa11caa01fa12
+					.quad 0x3fff81f81f81f820
+					.quad 0x3fff6310aca0dbb5
+					.quad 0x3fff44659e4a4271
+					.quad 0x3fff25f644230ab5
+					.quad 0x3fff07c1f07c1f08
+					.quad 0x3ffee9c7f8458e02
+					.quad 0x3ffecc07b301ecc0
+					.quad 0x3ffeae807aba01eb
+					.quad 0x3ffe9131abf0b767
+					.quad 0x3ffe741aa59750e4
+					.quad 0x3ffe573ac901e574
+					.quad 0x3ffe3a9179dc1a73
+					.quad 0x3ffe1e1e1e1e1e1e
+					.quad 0x3ffe01e01e01e01e
+					.quad 0x3ffde5d6e3f8868a
+					.quad 0x3ffdca01dca01dca
+					.quad 0x3ffdae6076b981db
+					.quad 0x3ffd92f2231e7f8a
+					.quad 0x3ffd77b654b82c34
+					.quad 0x3ffd5cac807572b2
+					.quad 0x3ffd41d41d41d41d
+					.quad 0x3ffd272ca3fc5b1a
+					.quad 0x3ffd0cb58f6ec074
+					.quad 0x3ffcf26e5c44bfc6
+					.quad 0x3ffcd85689039b0b
+					.quad 0x3ffcbe6d9601cbe7
+					.quad 0x3ffca4b3055ee191
+					.quad 0x3ffc8b265afb8a42
+					.quad 0x3ffc71c71c71c71c
+					.quad 0x3ffc5894d10d4986
+					.quad 0x3ffc3f8f01c3f8f0
+					.quad 0x3ffc26b5392ea01c
+					.quad 0x3ffc0e070381c0e0
+					.quad 0x3ffbf583ee868d8b
+					.quad 0x3ffbdd2b899406f7
+					.quad 0x3ffbc4fd65883e7b
+					.quad 0x3ffbacf914c1bad0
+					.quad 0x3ffb951e2b18ff23
+					.quad 0x3ffb7d6c3dda338b
+					.quad 0x3ffb65e2e3beee05
+					.quad 0x3ffb4e81b4e81b4f
+					.quad 0x3ffb37484ad806ce
+					.quad 0x3ffb2036406c80d9
+					.quad 0x3ffb094b31d922a4
+					.quad 0x3ffaf286bca1af28
+					.quad 0x3ffadbe87f94905e
+					.quad 0x3ffac5701ac5701b
+					.quad 0x3ffaaf1d2f87ebfd
+					.quad 0x3ffa98ef606a63be
+					.quad 0x3ffa82e65130e159
+					.quad 0x3ffa6d01a6d01a6d
+					.quad 0x3ffa574107688a4a
+					.quad 0x3ffa41a41a41a41a
+					.quad 0x3ffa2c2a87c51ca0
+					.quad 0x3ffa16d3f97a4b02
+					.quad 0x3ffa01a01a01a01a
+					.quad 0x3ff9ec8e951033d9
+					.quad 0x3ff9d79f176b682d
+					.quad 0x3ff9c2d14ee4a102
+					.quad 0x3ff9ae24ea5510da
+					.quad 0x3ff999999999999a
+					.quad 0x3ff9852f0d8ec0ff
+					.quad 0x3ff970e4f80cb872
+					.quad 0x3ff95cbb0be377ae
+					.quad 0x3ff948b0fcd6e9e0
+					.quad 0x3ff934c67f9b2ce6
+					.quad 0x3ff920fb49d0e229
+					.quad 0x3ff90d4f120190d5
+					.quad 0x3ff8f9c18f9c18fa
+					.quad 0x3ff8e6527af1373f
+					.quad 0x3ff8d3018d3018d3
+					.quad 0x3ff8bfce8062ff3a
+					.quad 0x3ff8acb90f6bf3aa
+					.quad 0x3ff899c0f601899c
+					.quad 0x3ff886e5f0abb04a
+					.quad 0x3ff87427bcc092b9
+					.quad 0x3ff8618618618618
+					.quad 0x3ff84f00c2780614
+					.quad 0x3ff83c977ab2bedd
+					.quad 0x3ff82a4a0182a4a0
+					.quad 0x3ff8181818181818
+					.quad 0x3ff8060180601806
+					.quad 0x3ff7f405fd017f40
+					.quad 0x3ff7e225515a4f1d
+					.quad 0x3ff7d05f417d05f4
+					.quad 0x3ff7beb3922e017c
+					.quad 0x3ff7ad2208e0ecc3
+					.quad 0x3ff79baa6bb6398b
+					.quad 0x3ff78a4c8178a4c8
+					.quad 0x3ff77908119ac60d
+					.quad 0x3ff767dce434a9b1
+					.quad 0x3ff756cac201756d
+					.quad 0x3ff745d1745d1746
+					.quad 0x3ff734f0c541fe8d
+					.quad 0x3ff724287f46debc
+					.quad 0x3ff713786d9c7c09
+					.quad 0x3ff702e05c0b8170
+					.quad 0x3ff6f26016f26017
+					.quad 0x3ff6e1f76b4337c7
+					.quad 0x3ff6d1a62681c861
+					.quad 0x3ff6c16c16c16c17
+					.quad 0x3ff6b1490aa31a3d
+					.quad 0x3ff6a13cd1537290
+					.quad 0x3ff691473a88d0c0
+					.quad 0x3ff6816816816817
+					.quad 0x3ff6719f3601671a
+					.quad 0x3ff661ec6a5122f9
+					.quad 0x3ff6524f853b4aa3
+					.quad 0x3ff642c8590b2164
+					.quad 0x3ff63356b88ac0de
+					.quad 0x3ff623fa77016240
+					.quad 0x3ff614b36831ae94
+					.quad 0x3ff6058160581606
+					.quad 0x3ff5f66434292dfc
+					.quad 0x3ff5e75bb8d015e7
+					.quad 0x3ff5d867c3ece2a5
+					.quad 0x3ff5c9882b931057
+					.quad 0x3ff5babcc647fa91
+					.quad 0x3ff5ac056b015ac0
+					.quad 0x3ff59d61f123ccaa
+					.quad 0x3ff58ed2308158ed
+					.quad 0x3ff5805601580560
+					.quad 0x3ff571ed3c506b3a
+					.quad 0x3ff56397ba7c52e2
+					.quad 0x3ff5555555555555
+					.quad 0x3ff54725e6bb82fe
+					.quad 0x3ff5390948f40feb
+					.quad 0x3ff52aff56a8054b
+					.quad 0x3ff51d07eae2f815
+					.quad 0x3ff50f22e111c4c5
+					.quad 0x3ff5015015015015
+					.quad 0x3ff4f38f62dd4c9b
+					.quad 0x3ff4e5e0a72f0539
+					.quad 0x3ff4d843bedc2c4c
+					.quad 0x3ff4cab88725af6e
+					.quad 0x3ff4bd3edda68fe1
+					.quad 0x3ff4afd6a052bf5b
+					.quad 0x3ff4a27fad76014a
+					.quad 0x3ff49539e3b2d067
+					.quad 0x3ff4880522014880
+					.quad 0x3ff47ae147ae147b
+					.quad 0x3ff46dce34596066
+					.quad 0x3ff460cbc7f5cf9a
+					.quad 0x3ff453d9e2c776ca
+					.quad 0x3ff446f86562d9fb
+					.quad 0x3ff43a2730abee4d
+					.quad 0x3ff42d6625d51f87
+					.quad 0x3ff420b5265e5951
+					.quad 0x3ff4141414141414
+					.quad 0x3ff40782d10e6566
+					.quad 0x3ff3fb013fb013fb
+					.quad 0x3ff3ee8f42a5af07
+					.quad 0x3ff3e22cbce4a902
+					.quad 0x3ff3d5d991aa75c6
+					.quad 0x3ff3c995a47babe7
+					.quad 0x3ff3bd60d9232955
+					.quad 0x3ff3b13b13b13b14
+					.quad 0x3ff3a524387ac822
+					.quad 0x3ff3991c2c187f63
+					.quad 0x3ff38d22d366088e
+					.quad 0x3ff3813813813814
+					.quad 0x3ff3755bd1c945ee
+					.quad 0x3ff3698df3de0748
+					.quad 0x3ff35dce5f9f2af8
+					.quad 0x3ff3521cfb2b78c1
+					.quad 0x3ff34679ace01346
+					.quad 0x3ff33ae45b57bcb2
+					.quad 0x3ff32f5ced6a1dfa
+					.quad 0x3ff323e34a2b10bf
+					.quad 0x3ff3187758e9ebb6
+					.quad 0x3ff30d190130d190
+					.quad 0x3ff301c82ac40260
+					.quad 0x3ff2f684bda12f68
+					.quad 0x3ff2eb4ea1fed14b
+					.quad 0x3ff2e025c04b8097
+					.quad 0x3ff2d50a012d50a0
+					.quad 0x3ff2c9fb4d812ca0
+					.quad 0x3ff2bef98e5a3711
+					.quad 0x3ff2b404ad012b40
+					.quad 0x3ff2a91c92f3c105
+					.quad 0x3ff29e4129e4129e
+					.quad 0x3ff293725bb804a5
+					.quad 0x3ff288b01288b013
+					.quad 0x3ff27dfa38a1ce4d
+					.quad 0x3ff27350b8812735
+					.quad 0x3ff268b37cd60127
+					.quad 0x3ff25e22708092f1
+					.quad 0x3ff2539d7e9177b2
+					.quad 0x3ff2492492492492
+					.quad 0x3ff23eb79717605b
+					.quad 0x3ff23456789abcdf
+					.quad 0x3ff22a0122a0122a
+					.quad 0x3ff21fb78121fb78
+					.quad 0x3ff21579804855e6
+					.quad 0x3ff20b470c67c0d9
+					.quad 0x3ff2012012012012
+					.quad 0x3ff1f7047dc11f70
+					.quad 0x3ff1ecf43c7fb84c
+					.quad 0x3ff1e2ef3b3fb874
+					.quad 0x3ff1d8f5672e4abd
+					.quad 0x3ff1cf06ada2811d
+					.quad 0x3ff1c522fc1ce059
+					.quad 0x3ff1bb4a4046ed29
+					.quad 0x3ff1b17c67f2bae3
+					.quad 0x3ff1a7b9611a7b96
+					.quad 0x3ff19e0119e0119e
+					.quad 0x3ff19453808ca29c
+					.quad 0x3ff18ab083902bdb
+					.quad 0x3ff1811811811812
+					.quad 0x3ff1778a191bd684
+					.quad 0x3ff16e0689427379
+					.quad 0x3ff1648d50fc3201
+					.quad 0x3ff15b1e5f75270d
+					.quad 0x3ff151b9a3fdd5c9
+					.quad 0x3ff1485f0e0acd3b
+					.quad 0x3ff13f0e8d344724
+					.quad 0x3ff135c81135c811
+					.quad 0x3ff12c8b89edc0ac
+					.quad 0x3ff12358e75d3033
+					.quad 0x3ff11a3019a74826
+					.quad 0x3ff1111111111111
+					.quad 0x3ff107fbbe011080
+					.quad 0x3ff0fef010fef011
+					.quad 0x3ff0f5edfab325a2
+					.quad 0x3ff0ecf56be69c90
+					.quad 0x3ff0e40655826011
+					.quad 0x3ff0db20a88f4696
+					.quad 0x3ff0d24456359e3a
+					.quad 0x3ff0c9714fbcda3b
+					.quad 0x3ff0c0a7868b4171
+					.quad 0x3ff0b7e6ec259dc8
+					.quad 0x3ff0af2f722eecb5
+					.quad 0x3ff0a6810a6810a7
+					.quad 0x3ff09ddba6af8360
+					.quad 0x3ff0953f39010954
+					.quad 0x3ff08cabb37565e2
+					.quad 0x3ff0842108421084
+					.quad 0x3ff07b9f29b8eae2
+					.quad 0x3ff073260a47f7c6
+					.quad 0x3ff06ab59c7912fb
+					.quad 0x3ff0624dd2f1a9fc
+					.quad 0x3ff059eea0727586
+					.quad 0x3ff05197f7d73404
+					.quad 0x3ff04949cc1664c5
+					.quad 0x3ff0410410410410
+					.quad 0x3ff038c6b78247fc
+					.quad 0x3ff03091b51f5e1a
+					.quad 0x3ff02864fc7729e9
+					.quad 0x3ff0204081020408
+					.quad 0x3ff0182436517a37
+					.quad 0x3ff0101010101010
+					.quad 0x3ff0080402010080
+					.quad 0x3ff0000000000000
+

diff --git a/src/gas/cbrtf.S b/src/gas/cbrtf.S
new file mode 100644
index 0000000..21bdd0b
--- /dev/null
+++ b/src/gas/cbrtf.S

@@ -0,0 +1,717 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# cbrtf.S
+#
+# An implementation of the cbrtf libm function.
+#
+# Prototype:
+#
+#     float cbrtf(float x);
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(cbrtf)
+#define fname_special _cbrtf_special
+
+
+# local variable storage offsets
+
+.equ   store_input, 0x0 
+.equ   stack_size, 0x20 
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 32
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+    xor   %rcx,%rcx
+    sub   $stack_size, %rsp
+    movss %xmm0, store_input(%rsp)
+    movss %xmm0,%xmm1
+    mov   store_input(%rsp),%r8
+    mov   $0x7F800000,%r10
+    mov   $0x007FFFFF,%r11
+    mov   %r8,%r9
+    and   %r10,%r8 # r8 = stores the exponent
+    and   %r11,%r9 # r9 = stores the mantissa
+    cmp   $0X7F800000,%r8
+    jz    .L__cbrtf_is_nan_infinite
+    cmp   $0X0,%r8
+    jz    .L__cbrtf_is_denormal
+.align 32
+.L__cbrtf_is_normal:   
+    cvtps2pd %xmm1,%xmm1
+    shr   $23,%r8  # exp value
+    mov   $3,%rdx # check whether always dx is set to 3
+    mov   %r8,%rax
+    movsd %xmm1,%xmm6
+    shr   $15,%r9  # index for the reciprocal
+    sub   $0x7F,%ax
+    idiv  %dl # Accumulator is divided by dl=3
+    mov   %ax,%dx
+    shr   $8,%dx #dx = Contains the remainder
+    add   $2,%dl
+                 #ax = Contains the quotient, Scale factor
+    cbw          # sign extend al to ax
+    add   $0x3FF,%ax
+    shl   $52,%rax
+    pand .L__mantissa_mask_64(%rip),%xmm1    
+    mov   %rax,store_input(%rsp)
+    movsd store_input(%rsp),%xmm7
+    movsd  .L__sign_mask_64(%rip),%xmm2
+    por .L__one_mask_64(%rip),%xmm1
+    movapd .L__coefficients(%rip),%xmm0
+    pandn %xmm1,%xmm2
+    pand  .L__sign_mask_64(%rip),%xmm6 # has the sign
+    lea .L__DoubleReciprocalTable_256(%rip),%r8
+    lea .L__CubeRootTable_256(%rip),%rax
+    movsd (%r8,%r9,8),%xmm3#reciprocal, Size of double is 8
+    movsd (%rax,%r9,8),%xmm4#cuberoot
+    mulsd %xmm2,%xmm3
+    subsd .L__one_mask_64(%rip),%xmm3
+    
+    # movddup %xmm3,%xmm3
+    shufpd $0,%xmm3,%xmm3 # replacing movddup
+    
+    mulsd %xmm3,%xmm3
+    mulpd %xmm3,%xmm0
+#######################################################################    
+#haddpd is an SSE3 instruction On using this instruction it gives a better performance    
+    #haddpd %xmm0,%xmm0
+#Following has to be commented and the above haddpd has to be uncommented if we can
+#use the SSE3 instructions
+    movapd %xmm0,%xmm3
+    unpckhpd %xmm3,%xmm3
+    addsd %xmm3,%xmm0
+#######################################################################    
+    addsd .L__one_mask_64(%rip),%xmm0
+    mulsd %xmm7,%xmm0
+    lea .L__defined_cuberoot(%rip),%rax
+    mulsd (%rax,%rdx,8),%xmm0
+    
+    mulsd %xmm4,%xmm0
+    cmp $1,%cx
+    jnz .L__final_result
+    mulsd .L__denormal_factor(%rip),%xmm0
+
+.align 32
+.L__final_result:
+    por %xmm6, %xmm0 
+    cvtsd2ss %xmm0,%xmm0
+    add   $stack_size, %rsp
+    ret
+     
+         
+.align 32
+.L__cbrtf_is_denormal:   
+    cmp $0,%r9
+    jz .L__cbrtf_is_zero
+    mulss .L__2_pow_23(%rip),%xmm1
+    movss  %xmm1, store_input(%rsp)
+    mov   $1,%cx
+    mov   store_input(%rsp),%r8
+    mov   %r8,%r9
+    and   %r10,%r8 # r8 = stores the exponent
+    and   %r11,%r9 # r9 = stores the mantissa
+    jmp .L__cbrtf_is_normal 
+
+.align 32
+.L__cbrtf_is_nan_infinite:
+    cmp $0,%r9
+    jz .L__cbrtf_is_infinite
+    mulss %xmm0,%xmm0 #this multiplication will raise an invalid exception
+    por  .L__qnan_mask_32(%rip),%xmm0
+
+.L__cbrtf_is_infinite:
+.L__cbrtf_is_one:    
+.L__cbrtf_is_zero:    
+    add   $stack_size, %rsp
+    ret
+
+.align 32 
+.L__mantissa_mask_32:      .long 0x007FFFFF
+                           .long 0          #this zero is necessary
+.align 16                           
+.L__qnan_mask_32:          .long 0x00400000
+                           .long 0
+.L__exp_mask_32:           .long 0x7F800000
+                           .long 0
+.L__zero:                  .long 0x00000000
+                           .long 0
+.align 16
+.L__mantissa_mask_64:      .quad 0x000FFFFFFFFFFFFF
+.L__2_pow_23:              .long 0x4B000000
+
+
+.align 16
+.L__sign_mask_64:          .quad 0x8000000000000000 
+                           .quad 0
+.L__one_mask_64:           .quad 0x3FF0000000000000 
+                           .quad 0
+
+.align 16
+.L__denormal_factor:       .quad 0x3F7428A2F98D728B 
+                           .quad 0
+.align 16
+.L__coefficients:
+    .quad 0xbFBC71C71C71C71C
+    .quad 0x3fd5555555555555
+.align 16
+.L__defined_cuberoot:   .quad 0x3FE428A2F98D728B
+                        .quad 0x3FE965FEA53D6E3D
+                        .quad 0x3FF0000000000000
+                        .quad 0x3FF428A2F98D728B
+                        .quad 0x3FF965FEA53D6E3D
+                         
+.align 32 
+.L__DoubleReciprocalTable_256: .quad 0X3ff0000000000000
+            .quad 0X3fefe00000000000
+            .quad 0X3fefc00000000000
+            .quad 0X3fefa00000000000
+            .quad 0X3fef800000000000
+            .quad 0X3fef600000000000
+            .quad 0X3fef400000000000
+            .quad 0X3fef200000000000
+            .quad 0X3fef000000000000
+            .quad 0X3feee00000000000
+            .quad 0X3feec00000000000
+            .quad 0X3feea00000000000
+            .quad 0X3fee900000000000
+            .quad 0X3fee700000000000
+            .quad 0X3fee500000000000
+            .quad 0X3fee300000000000
+            .quad 0X3fee100000000000
+            .quad 0X3fee000000000000
+            .quad 0X3fede00000000000
+            .quad 0X3fedc00000000000
+            .quad 0X3feda00000000000
+            .quad 0X3fed900000000000
+            .quad 0X3fed700000000000
+            .quad 0X3fed500000000000
+            .quad 0X3fed400000000000
+            .quad 0X3fed200000000000
+            .quad 0X3fed000000000000
+            .quad 0X3fecf00000000000
+            .quad 0X3fecd00000000000
+            .quad 0X3fecb00000000000
+            .quad 0X3feca00000000000
+            .quad 0X3fec800000000000
+            .quad 0X3fec700000000000
+            .quad 0X3fec500000000000
+            .quad 0X3fec300000000000
+            .quad 0X3fec200000000000
+            .quad 0X3fec000000000000
+            .quad 0X3febf00000000000
+            .quad 0X3febd00000000000
+            .quad 0X3febc00000000000
+            .quad 0X3feba00000000000
+            .quad 0X3feb900000000000
+            .quad 0X3feb700000000000
+            .quad 0X3feb600000000000
+            .quad 0X3feb400000000000
+            .quad 0X3feb300000000000
+            .quad 0X3feb200000000000
+            .quad 0X3feb000000000000
+            .quad 0X3feaf00000000000
+            .quad 0X3fead00000000000
+            .quad 0X3feac00000000000
+            .quad 0X3feaa00000000000
+            .quad 0X3fea900000000000
+            .quad 0X3fea800000000000
+            .quad 0X3fea600000000000
+            .quad 0X3fea500000000000
+            .quad 0X3fea400000000000
+            .quad 0X3fea200000000000
+            .quad 0X3fea100000000000
+            .quad 0X3fea000000000000
+            .quad 0X3fe9e00000000000
+            .quad 0X3fe9d00000000000
+            .quad 0X3fe9c00000000000
+            .quad 0X3fe9a00000000000
+            .quad 0X3fe9900000000000
+            .quad 0X3fe9800000000000
+            .quad 0X3fe9700000000000
+            .quad 0X3fe9500000000000
+            .quad 0X3fe9400000000000
+            .quad 0X3fe9300000000000
+            .quad 0X3fe9200000000000
+            .quad 0X3fe9000000000000
+            .quad 0X3fe8f00000000000
+            .quad 0X3fe8e00000000000
+            .quad 0X3fe8d00000000000
+            .quad 0X3fe8b00000000000
+            .quad 0X3fe8a00000000000
+            .quad 0X3fe8900000000000
+            .quad 0X3fe8800000000000
+            .quad 0X3fe8700000000000
+            .quad 0X3fe8600000000000
+            .quad 0X3fe8400000000000
+            .quad 0X3fe8300000000000
+            .quad 0X3fe8200000000000
+            .quad 0X3fe8100000000000
+            .quad 0X3fe8000000000000
+            .quad 0X3fe7f00000000000
+            .quad 0X3fe7e00000000000
+            .quad 0X3fe7d00000000000
+            .quad 0X3fe7b00000000000
+            .quad 0X3fe7a00000000000
+            .quad 0X3fe7900000000000
+            .quad 0X3fe7800000000000
+            .quad 0X3fe7700000000000
+            .quad 0X3fe7600000000000
+            .quad 0X3fe7500000000000
+            .quad 0X3fe7400000000000
+            .quad 0X3fe7300000000000
+            .quad 0X3fe7200000000000
+            .quad 0X3fe7100000000000
+            .quad 0X3fe7000000000000
+            .quad 0X3fe6f00000000000
+            .quad 0X3fe6e00000000000
+            .quad 0X3fe6d00000000000
+            .quad 0X3fe6c00000000000
+            .quad 0X3fe6b00000000000
+            .quad 0X3fe6a00000000000
+            .quad 0X3fe6900000000000
+            .quad 0X3fe6800000000000
+            .quad 0X3fe6700000000000
+            .quad 0X3fe6600000000000
+            .quad 0X3fe6500000000000
+            .quad 0X3fe6400000000000
+            .quad 0X3fe6300000000000
+            .quad 0X3fe6200000000000
+            .quad 0X3fe6100000000000
+            .quad 0X3fe6000000000000
+            .quad 0X3fe5f00000000000
+            .quad 0X3fe5e00000000000
+            .quad 0X3fe5d00000000000
+            .quad 0X3fe5c00000000000
+            .quad 0X3fe5b00000000000
+            .quad 0X3fe5a00000000000
+            .quad 0X3fe5900000000000
+            .quad 0X3fe5800000000000
+            .quad 0X3fe5800000000000
+            .quad 0X3fe5700000000000
+            .quad 0X3fe5600000000000
+            .quad 0X3fe5500000000000
+            .quad 0X3fe5400000000000
+            .quad 0X3fe5300000000000
+            .quad 0X3fe5200000000000
+            .quad 0X3fe5100000000000
+            .quad 0X3fe5000000000000
+            .quad 0X3fe5000000000000
+            .quad 0X3fe4f00000000000
+            .quad 0X3fe4e00000000000
+            .quad 0X3fe4d00000000000
+            .quad 0X3fe4c00000000000
+            .quad 0X3fe4b00000000000
+            .quad 0X3fe4a00000000000
+            .quad 0X3fe4a00000000000
+            .quad 0X3fe4900000000000
+            .quad 0X3fe4800000000000
+            .quad 0X3fe4700000000000
+            .quad 0X3fe4600000000000
+            .quad 0X3fe4600000000000
+            .quad 0X3fe4500000000000
+            .quad 0X3fe4400000000000
+            .quad 0X3fe4300000000000
+            .quad 0X3fe4200000000000
+            .quad 0X3fe4200000000000
+            .quad 0X3fe4100000000000
+            .quad 0X3fe4000000000000
+            .quad 0X3fe3f00000000000
+            .quad 0X3fe3e00000000000
+            .quad 0X3fe3e00000000000
+            .quad 0X3fe3d00000000000
+            .quad 0X3fe3c00000000000
+            .quad 0X3fe3b00000000000
+            .quad 0X3fe3b00000000000
+            .quad 0X3fe3a00000000000
+            .quad 0X3fe3900000000000
+            .quad 0X3fe3800000000000
+            .quad 0X3fe3800000000000
+            .quad 0X3fe3700000000000
+            .quad 0X3fe3600000000000
+            .quad 0X3fe3500000000000
+            .quad 0X3fe3500000000000
+            .quad 0X3fe3400000000000
+            .quad 0X3fe3300000000000
+            .quad 0X3fe3200000000000
+            .quad 0X3fe3200000000000
+            .quad 0X3fe3100000000000
+            .quad 0X3fe3000000000000
+            .quad 0X3fe3000000000000
+            .quad 0X3fe2f00000000000
+            .quad 0X3fe2e00000000000
+            .quad 0X3fe2e00000000000
+            .quad 0X3fe2d00000000000
+            .quad 0X3fe2c00000000000
+            .quad 0X3fe2b00000000000
+            .quad 0X3fe2b00000000000
+            .quad 0X3fe2a00000000000
+            .quad 0X3fe2900000000000
+            .quad 0X3fe2900000000000
+            .quad 0X3fe2800000000000
+            .quad 0X3fe2700000000000
+            .quad 0X3fe2700000000000
+            .quad 0X3fe2600000000000
+            .quad 0X3fe2500000000000
+            .quad 0X3fe2500000000000
+            .quad 0X3fe2400000000000
+            .quad 0X3fe2300000000000
+            .quad 0X3fe2300000000000
+            .quad 0X3fe2200000000000
+            .quad 0X3fe2100000000000
+            .quad 0X3fe2100000000000
+            .quad 0X3fe2000000000000
+            .quad 0X3fe2000000000000
+            .quad 0X3fe1f00000000000
+            .quad 0X3fe1e00000000000
+            .quad 0X3fe1e00000000000
+            .quad 0X3fe1d00000000000
+            .quad 0X3fe1c00000000000
+            .quad 0X3fe1c00000000000
+            .quad 0X3fe1b00000000000
+            .quad 0X3fe1b00000000000
+            .quad 0X3fe1a00000000000
+            .quad 0X3fe1900000000000
+            .quad 0X3fe1900000000000
+            .quad 0X3fe1800000000000
+            .quad 0X3fe1800000000000
+            .quad 0X3fe1700000000000
+            .quad 0X3fe1600000000000
+            .quad 0X3fe1600000000000
+            .quad 0X3fe1500000000000
+            .quad 0X3fe1500000000000
+            .quad 0X3fe1400000000000
+            .quad 0X3fe1300000000000
+            .quad 0X3fe1300000000000
+            .quad 0X3fe1200000000000
+            .quad 0X3fe1200000000000
+            .quad 0X3fe1100000000000
+            .quad 0X3fe1100000000000
+            .quad 0X3fe1000000000000
+            .quad 0X3fe0f00000000000
+            .quad 0X3fe0f00000000000
+            .quad 0X3fe0e00000000000
+            .quad 0X3fe0e00000000000
+            .quad 0X3fe0d00000000000
+            .quad 0X3fe0d00000000000
+            .quad 0X3fe0c00000000000
+            .quad 0X3fe0c00000000000
+            .quad 0X3fe0b00000000000
+            .quad 0X3fe0a00000000000
+            .quad 0X3fe0a00000000000
+            .quad 0X3fe0900000000000
+            .quad 0X3fe0900000000000
+            .quad 0X3fe0800000000000
+            .quad 0X3fe0800000000000
+            .quad 0X3fe0700000000000
+            .quad 0X3fe0700000000000
+            .quad 0X3fe0600000000000
+            .quad 0X3fe0600000000000
+            .quad 0X3fe0500000000000
+            .quad 0X3fe0500000000000
+            .quad 0X3fe0400000000000
+            .quad 0X3fe0400000000000
+            .quad 0X3fe0300000000000
+            .quad 0X3fe0300000000000
+            .quad 0X3fe0200000000000
+            .quad 0X3fe0200000000000
+            .quad 0X3fe0100000000000
+            .quad 0X3fe0100000000000
+            .quad 0X3fe0000000000000
+    
+.align 32
+.L__CubeRootTable_256:   .quad 0X3ff0000000000000 
+                         .quad 0X3ff00558e6547c36 
+                         .quad 0X3ff00ab8f9d2f374 
+                         .quad 0X3ff010204b673fc7 
+                         .quad 0X3ff0158eec36749b 
+                         .quad 0X3ff01b04ed9fdb53 
+                         .quad 0X3ff02082613df53c 
+                         .quad 0X3ff0260758e78308 
+                         .quad 0X3ff02b93e6b091f0 
+                         .quad 0X3ff031281ceb8ea2 
+                         .quad 0X3ff036c40e2a5e2a 
+                         .quad 0X3ff03c67cd3f7cea 
+                         .quad 0X3ff03f3c9fee224c 
+                         .quad 0X3ff044ec379f7f79 
+                         .quad 0X3ff04aa3cd578d67 
+                         .quad 0X3ff0506374d40a3d 
+                         .quad 0X3ff0562b4218a6e3 
+                         .quad 0X3ff059123d3a9848 
+                         .quad 0X3ff05ee6694e7166 
+                         .quad 0X3ff064c2ee6e07c6 
+                         .quad 0X3ff06aa7e19c01c5 
+                         .quad 0X3ff06d9d8b1decca 
+                         .quad 0X3ff0738f4b6cc8e2 
+                         .quad 0X3ff07989af9f9f59 
+                         .quad 0X3ff07c8a2611201c 
+                         .quad 0X3ff08291a9958f03 
+                         .quad 0X3ff088a208c3fe28 
+                         .quad 0X3ff08bad91dd7d8b 
+                         .quad 0X3ff091cb6588465e 
+                         .quad 0X3ff097f24eab04a1 
+                         .quad 0X3ff09b0932aee3f2 
+                         .quad 0X3ff0a13de8970de4 
+                         .quad 0X3ff0a45bc08a5ac7 
+                         .quad 0X3ff0aa9e79bfa986 
+                         .quad 0X3ff0b0eaa961ca5b 
+                         .quad 0X3ff0b4145573271c 
+                         .quad 0X3ff0ba6ee5f9aad4 
+                         .quad 0X3ff0bd9fd0dbe02d 
+                         .quad 0X3ff0c408fc1cfd4b 
+                         .quad 0X3ff0c741430e2059 
+                         .quad 0X3ff0cdb9442ea813 
+                         .quad 0X3ff0d0f905168e6c 
+                         .quad 0X3ff0d7801893d261 
+                         .quad 0X3ff0dac772091bde 
+                         .quad 0X3ff0e15dd5c330ab 
+                         .quad 0X3ff0e4ace71080a4 
+                         .quad 0X3ff0e7fe920f3037 
+                         .quad 0X3ff0eea9c37e497e 
+                         .quad 0X3ff0f203512f4314 
+                         .quad 0X3ff0f8be68db7f32 
+                         .quad 0X3ff0fc1ffa42d902 
+                         .quad 0X3ff102eb3af9ed89 
+                         .quad 0X3ff10654f1e29cfb 
+                         .quad 0X3ff109c1679c189f 
+                         .quad 0X3ff110a29f080b3d 
+                         .quad 0X3ff114176891738a 
+                         .quad 0X3ff1178f0099b429 
+                         .quad 0X3ff11e86ac2cd7ab 
+                         .quad 0X3ff12206c7cf4046 
+                         .quad 0X3ff12589c21fb842 
+                         .quad 0X3ff12c986355d0d2 
+                         .quad 0X3ff13024129645cf 
+                         .quad 0X3ff133b2b13aa0eb 
+                         .quad 0X3ff13ad8cdc48ba3 
+                         .quad 0X3ff13e70544b1d4f 
+                         .quad 0X3ff1420adb77c99a 
+                         .quad 0X3ff145a867b1bfea 
+                         .quad 0X3ff14ceca1189d6d 
+                         .quad 0X3ff15093574284e9 
+                         .quad 0X3ff1543d2473ea9b 
+                         .quad 0X3ff157ea0d433a46 
+                         .quad 0X3ff15f4d44462724 
+                         .quad 0X3ff163039bd7cde6 
+                         .quad 0X3ff166bd21c3a8e2 
+                         .quad 0X3ff16a79dad1fb59 
+                         .quad 0X3ff171fcf9aaac3d 
+                         .quad 0X3ff175c3693980c3 
+                         .quad 0X3ff1798d1f73f3ef 
+                         .quad 0X3ff17d5a2156e97f 
+                         .quad 0X3ff1812a73ea2593 
+                         .quad 0X3ff184fe1c406b8f 
+                         .quad 0X3ff18caf82b8dba4 
+                         .quad 0X3ff1908d4b38a510 
+                         .quad 0X3ff1946e7e36f7e5 
+                         .quad 0X3ff1985320ff72a2 
+                         .quad 0X3ff19c3b38e975a8 
+                         .quad 0X3ff1a026cb58453d 
+                         .quad 0X3ff1a415ddbb2c10 
+                         .quad 0X3ff1a808758d9e32 
+                         .quad 0X3ff1aff84bac98ea 
+                         .quad 0X3ff1b3f5952e1a50 
+                         .quad 0X3ff1b7f67a896220 
+                         .quad 0X3ff1bbfb0178d186 
+                         .quad 0X3ff1c0032fc3cf91 
+                         .quad 0X3ff1c40f0b3eefc4 
+                         .quad 0X3ff1c81e99cc193f 
+                         .quad 0X3ff1cc31e15aae72 
+                         .quad 0X3ff1d048e7e7b565 
+                         .quad 0X3ff1d463b37e0090 
+                         .quad 0X3ff1d8824a365852 
+                         .quad 0X3ff1dca4b237a4f7 
+                         .quad 0X3ff1e0caf1b71965 
+                         .quad 0X3ff1e4f50ef85e61 
+                         .quad 0X3ff1e923104dbe76 
+                         .quad 0X3ff1ed54fc185286 
+                         .quad 0X3ff1f18ad8c82efc 
+                         .quad 0X3ff1f5c4acdc91aa 
+                         .quad 0X3ff1fa027ee4105b 
+                         .quad 0X3ff1fe44557cc808 
+                         .quad 0X3ff2028a37548ccf 
+                         .quad 0X3ff206d42b291a95 
+                         .quad 0X3ff20b2237c8466a 
+                         .quad 0X3ff20f74641030a6 
+                         .quad 0X3ff213cab6ef77c7 
+                         .quad 0X3ff2182537656c13 
+                         .quad 0X3ff21c83ec824406 
+                         .quad 0X3ff220e6dd675180 
+                         .quad 0X3ff2254e114737d2 
+                         .quad 0X3ff229b98f66228c 
+                         .quad 0X3ff22e295f19fd31 
+                         .quad 0X3ff2329d87caabb6 
+                         .quad 0X3ff2371610f243f2 
+                         .quad 0X3ff23b93021d47da 
+                         .quad 0X3ff2401462eae0b8 
+                         .quad 0X3ff2449a3b0d1b3f 
+                         .quad 0X3ff2449a3b0d1b3f 
+                         .quad 0X3ff2492492492492 
+                         .quad 0X3ff24db370778844 
+                         .quad 0X3ff25246dd846f45 
+                         .quad 0X3ff256dee16fdfd4 
+                         .quad 0X3ff25b7b844dfe71 
+                         .quad 0X3ff2601cce474fd2 
+                         .quad 0X3ff264c2c798fbe5 
+                         .quad 0X3ff2696d789511e2 
+                         .quad 0X3ff2696d789511e2 
+                         .quad 0X3ff26e1ce9a2cd73 
+                         .quad 0X3ff272d1233edcf3 
+                         .quad 0X3ff2778a2dfba8d0 
+                         .quad 0X3ff27c4812819c13 
+                         .quad 0X3ff2810ad98f6e10 
+                         .quad 0X3ff285d28bfa6d45 
+                         .quad 0X3ff285d28bfa6d45 
+                         .quad 0X3ff28a9f32aecb79 
+                         .quad 0X3ff28f70d6afeb08 
+                         .quad 0X3ff294478118ad83 
+                         .quad 0X3ff299233b1bc38a 
+                         .quad 0X3ff299233b1bc38a 
+                         .quad 0X3ff29e040e03fdfb 
+                         .quad 0X3ff2a2ea0334a07b 
+                         .quad 0X3ff2a7d52429b556 
+                         .quad 0X3ff2acc57a7862c2 
+                         .quad 0X3ff2acc57a7862c2 
+                         .quad 0X3ff2b1bb0fcf4190 
+                         .quad 0X3ff2b6b5edf6b54a 
+                         .quad 0X3ff2bbb61ed145cf 
+                         .quad 0X3ff2c0bbac5bfa6e 
+                         .quad 0X3ff2c0bbac5bfa6e 
+                         .quad 0X3ff2c5c6a0aeb681 
+                         .quad 0X3ff2cad705fc97a6 
+                         .quad 0X3ff2cfece6945583 
+                         .quad 0X3ff2cfece6945583 
+                         .quad 0X3ff2d5084ce0a331 
+                         .quad 0X3ff2da294368924f 
+                         .quad 0X3ff2df4fd4cff7c3 
+                         .quad 0X3ff2df4fd4cff7c3 
+                         .quad 0X3ff2e47c0bd7d237 
+                         .quad 0X3ff2e9adf35eb25a 
+                         .quad 0X3ff2eee5966124e8 
+                         .quad 0X3ff2eee5966124e8 
+                         .quad 0X3ff2f422fffa1e92 
+                         .quad 0X3ff2f9663b6369b6 
+                         .quad 0X3ff2feaf53f61612 
+                         .quad 0X3ff2feaf53f61612 
+                         .quad 0X3ff303fe552aea57 
+                         .quad 0X3ff309534a9ad7ce 
+                         .quad 0X3ff309534a9ad7ce 
+                         .quad 0X3ff30eae3fff6ff3 
+                         .quad 0X3ff3140f41335c2f 
+                         .quad 0X3ff3140f41335c2f 
+                         .quad 0X3ff319765a32d7ae 
+                         .quad 0X3ff31ee3971c2b5b 
+                         .quad 0X3ff3245704302c13 
+                         .quad 0X3ff3245704302c13 
+                         .quad 0X3ff329d0add2bb20 
+                         .quad 0X3ff32f50a08b48f9 
+                         .quad 0X3ff32f50a08b48f9 
+                         .quad 0X3ff334d6e9055a5f 
+                         .quad 0X3ff33a6394110fe6 
+                         .quad 0X3ff33a6394110fe6 
+                         .quad 0X3ff33ff6aea3afed 
+                         .quad 0X3ff3459045d8331b 
+                         .quad 0X3ff3459045d8331b 
+                         .quad 0X3ff34b3066efd36b 
+                         .quad 0X3ff350d71f529dd8 
+                         .quad 0X3ff350d71f529dd8 
+                         .quad 0X3ff356847c9006b4 
+                         .quad 0X3ff35c388c5f80bf 
+                         .quad 0X3ff35c388c5f80bf 
+                         .quad 0X3ff361f35ca116ff 
+                         .quad 0X3ff361f35ca116ff 
+                         .quad 0X3ff367b4fb5e0985 
+                         .quad 0X3ff36d7d76c96d0a 
+                         .quad 0X3ff36d7d76c96d0a 
+                         .quad 0X3ff3734cdd40cd95 
+                         .quad 0X3ff379233d4cd42a 
+                         .quad 0X3ff379233d4cd42a 
+                         .quad 0X3ff37f00a5a1ef96 
+                         .quad 0X3ff37f00a5a1ef96 
+                         .quad 0X3ff384e52521006c 
+                         .quad 0X3ff38ad0cad80848 
+                         .quad 0X3ff38ad0cad80848 
+                         .quad 0X3ff390c3a602dc60 
+                         .quad 0X3ff390c3a602dc60 
+                         .quad 0X3ff396bdc60bdb88 
+                         .quad 0X3ff39cbf3a8ca7a9 
+                         .quad 0X3ff39cbf3a8ca7a9 
+                         .quad 0X3ff3a2c8134ee2d1 
+                         .quad 0X3ff3a2c8134ee2d1 
+                         .quad 0X3ff3a8d8604cefe3 
+                         .quad 0X3ff3aef031b2b706 
+                         .quad 0X3ff3aef031b2b706 
+                         .quad 0X3ff3b50f97de6de5 
+                         .quad 0X3ff3b50f97de6de5 
+                         .quad 0X3ff3bb36a36163d8 
+                         .quad 0X3ff3bb36a36163d8 
+                         .quad 0X3ff3c1656500d20a 
+                         .quad 0X3ff3c79bedb6afb8 
+                         .quad 0X3ff3c79bedb6afb8 
+                         .quad 0X3ff3cdda4eb28aa2 
+                         .quad 0X3ff3cdda4eb28aa2 
+                         .quad 0X3ff3d420995a63c0 
+                         .quad 0X3ff3d420995a63c0 
+                         .quad 0X3ff3da6edf4b9061 
+                         .quad 0X3ff3da6edf4b9061 
+                         .quad 0X3ff3e0c5325b9fc2 
+                         .quad 0X3ff3e723a499453f 
+                         .quad 0X3ff3e723a499453f 
+                         .quad 0X3ff3ed8a484d473a 
+                         .quad 0X3ff3ed8a484d473a 
+                         .quad 0X3ff3f3f92ffb72d8 
+                         .quad 0X3ff3f3f92ffb72d8 
+                         .quad 0X3ff3fa706e6394a4 
+                         .quad 0X3ff3fa706e6394a4 
+                         .quad 0X3ff400f01682764a 
+                         .quad 0X3ff400f01682764a 
+                         .quad 0X3ff407783b92e17a 
+                         .quad 0X3ff407783b92e17a 
+                         .quad 0X3ff40e08f10ea81a 
+                         .quad 0X3ff40e08f10ea81a 
+                         .quad 0X3ff414a24aafb1e6 
+                         .quad 0X3ff414a24aafb1e6 
+                         .quad 0X3ff41b445c710fa7 
+                         .quad 0X3ff41b445c710fa7 
+                         .quad 0X3ff421ef3a901411 
+                         .quad 0X3ff421ef3a901411 
+                         .quad 0X3ff428a2f98d728b 
+
+
+
+
+
+

diff --git a/src/gas/copysign.S b/src/gas/copysign.S
new file mode 100644
index 0000000..d5b96cf
--- /dev/null
+++ b/src/gas/copysign.S

@@ -0,0 +1,63 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#copysign.S
+#
+# An implementation of the copysign libm function.
+#
+# The copysign functions produce a value with the magnitude of x and the sign of y.
+# They produce a NaN (with the sign of y) if x is a NaN. On implementations that
+# represent a signed zero but do not treat negative zero consistently in arithmetic
+# operations, the copysign functions regard the sign of zero as positive.
+#
+#
+# Prototype:
+#
+#     double copysign(float x, float y)
+#
+#
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(copysign)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+	PSLLQ $1,%xmm0
+	PSRLQ $1,%xmm0
+	PSRLQ $63,%xmm1
+	PSLLQ $63,%xmm1
+	POR   %xmm1,%xmm0
+	
+    ret

diff --git a/src/gas/copysignf.S b/src/gas/copysignf.S
new file mode 100644
index 0000000..90e63d6
--- /dev/null
+++ b/src/gas/copysignf.S

@@ -0,0 +1,70 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#copysignf.S
+#
+# An implementation of the copysignf libm function.
+#
+# The copysign functions produce a value with the magnitude of x and the sign of y.
+# They produce a NaN (with the sign of y) if x is a NaN. On implementations that
+# represent a signed zero but do not treat negative zero consistently in arithmetic
+# operations, the copysign functions regard the sign of zero as positive.
+#
+# Prototype:
+#
+#     float copysignf(float x, float y)#
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(copysignf)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+	#PANDN .L__fabsf_and_mask, %xmm1
+	#POR %xmm1,%xmm0 
+
+	PSLLD $1,%xmm0
+	PSRLD $1,%xmm0
+	PSRLD $31,%xmm1
+	PSLLD $31,%xmm1
+	POR   %xmm1,%xmm0
+	
+    ret
+
+#.align 16
+#.L__sign_mask:               .long 0x7FFFFFFF
+                             .long 0x0
+                             .quad 0x0
+

diff --git a/src/gas/cos.S b/src/gas/cos.S
new file mode 100644
index 0000000..dc227e0
--- /dev/null
+++ b/src/gas/cos.S

@@ -0,0 +1,485 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# An implementation of the cos function.
+#
+# Prototype:
+#
+#     double cos(double x);
+#
+#   Computes cos(x).
+#   It will provide proper C99 return values,
+#   but may not raise floating point status bits properly.
+#   Based on the NAG C implementation.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 32
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff  #Sign bit zero
+                        .quad 0                       # for alignment
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000  # 1.0
+                        .quad 0                    
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000  # 0.5
+                        .quad 0
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555  # 0.166666666666
+                        .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883  # twobypi
+                        .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000  # piby2_1
+                        .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331  # piby2_1tail
+                        .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000  # piby2_2
+                        .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073  # piby2_2tail
+                        .quad 0                   
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000  # mask for stripping head and tail
+                        .quad 0                    
+.L__real_411E848000000000: .quad 0x415312d000000000   # 5e6 0x0411E848000000000  # 5e5
+                        .quad 0
+.L__real_bfe0000000000000: .quad 0x0bfe0000000000000  # - 0.5
+                        .quad 0
+                        
+.align 32
+.Lcosarray:
+    .quad    0x3fa5555555555555                       # 0.0416667     c1
+    .quad    0
+    .quad    0xbf56c16c16c16967                       # -0.00138889   c2
+    .quad    0
+    .quad    0x3EFA01A019F4EC91                       # 2.48016e-005  c3
+    .quad    0
+    .quad    0xbE927E4FA17F667B                       # -2.75573e-007 c4
+    .quad    0
+    .quad    0x3E21EEB690382EEC                       # 2.08761e-009  c5
+    .quad    0
+    .quad    0xbDA907DB47258AA7                       # -1.13826e-011 c6
+    .quad    0
+
+.align 32
+.Lsinarray:
+    .quad    0xbfc5555555555555                       # -0.166667     s1
+    .quad    0
+    .quad    0x3f81111111110bb3                       # 0.00833333    s2
+    .quad    0
+    .quad    0xbf2a01a019e83e5c                       # -0.000198413  s3
+    .quad    0
+    .quad    0x3ec71de3796cde01                       # 2.75573e-006  s4
+    .quad    0
+    .quad    0xbe5ae600b42fdfa7                       # -2.50511e-008 s5
+    .quad    0
+    .quad    0x3de5e0b2f9a43bb8                       # 1.59181e-010  s6
+    .quad    0
+
+.text
+.align 32
+.p2align 5,,31
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(cos)
+#define fname_special _cos_special@PLT
+    
+# define local variable storage offsets
+.equ    p_temp,    0x30                               # temporary for get/put bits operation
+.equ    p_temp1,   0x40                               # temporary for get/put bits operation
+.equ    r,         0x50                               # pointer to r for amd_remainder_piby2
+.equ    rr,        0x60                               # pointer to rr for amd_remainder_piby2
+.equ    region,    0x70                               # pointer to region for amd_remainder_piby2
+.equ   stack_size, 0x98
+
+.globl fname
+.type  fname,@function
+
+fname:
+   sub      $stack_size, %rsp
+    xorpd   %xmm2, %xmm2                              # zeroed out for later use
+
+# GET_BITS_DP64(x, ux);
+# get the input value to an integer register.
+    movsd   %xmm0,p_temp(%rsp)
+    mov     p_temp(%rsp), %rdx                        # rdx is ux
+
+##  if NaN or inf
+    mov     $0x07ff0000000000000, %rax
+    mov     %rax, %r10
+    and     %rdx, %r10
+    cmp     %rax, %r10
+    jz      .Lcos_naninf
+
+#  ax = (ux & ~SIGNBIT_DP64);
+    mov     $0x07fffffffffffffff, %r10
+    and     %rdx, %r10                                # r10 is ax
+    mov     $1, %r8d                                  # for determining region later on
+
+
+##  if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+    mov     $0x03fe921fb54442d18, %rax
+    cmp     %rax, %r10
+    jg      .Lcos_reduce
+
+##      if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+    mov     $0x03f20000000000000, %rax
+    cmp     %rax, %r10
+    jge     .Lcos_small
+
+##          if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */
+    mov     $0x03e40000000000000, %rax
+    cmp     %rax, %r10
+    jge     .Lcos_smaller
+
+#                  cos = 1.0;
+    movsd   .L__real_3ff0000000000000(%rip), %xmm0    # return a 1
+    jmp     .Lcos_cleanup     
+
+##          else
+.align 16
+.Lcos_smaller:
+#              cos = 1.0 - x*x*0.5;
+    movsd   %xmm0, %xmm2
+    mulsd   %xmm2, %xmm2                # x^2
+    movsd   .L__real_3ff0000000000000(%rip), %xmm0    # 1.0
+    mulsd   .L__real_3fe0000000000000(%rip), %xmm2    # 0.5 * x^2
+    subsd   %xmm2, %xmm0
+    jmp     .Lcos_cleanup  
+           
+##      else
+
+.align 16
+.Lcos_small:
+#          cos = cos_piby4(x, 0.0);
+    movsd   %xmm0, %xmm2
+    mulsd   %xmm0, %xmm2                              # x2
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2     - do a cos calculation
+#  zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6))));
+
+    movsd   .Lcosarray+0x10(%rip), %xmm1              # c2
+    movsd   %xmm2, %xmm4                              # move for x4
+    mulsd   %xmm2, %xmm4                              # x4
+    movsd   .Lcosarray+0x30(%rip), %xmm3              # c4
+    mulsd   %xmm2, %xmm1                              # c2x2
+    movsd   .Lcosarray+0x50(%rip), %xmm5              # c6
+    mulsd   %xmm2, %xmm3                              # c4x2
+    movsd   %xmm4, %xmm0                              # move for x8
+    mulsd   %xmm2, %xmm5                              # c6x2
+    mulsd   %xmm4, %xmm0                              # x8
+    addsd   .Lcosarray(%rip), %xmm1                   # c1 + c2x2
+    mulsd   %xmm4, %xmm1                              # c1x4 + c2x6
+    addsd   .Lcosarray+0x20(%rip), %xmm3              # c3 + c4x2
+    mulsd   .L__real_bfe0000000000000(%rip), %xmm2    # -0.5x2, destroy xmm2
+    addsd   .Lcosarray+0x40(%rip), %xmm5              # c5 + c6x2
+    mulsd   %xmm0, %xmm3                              # c3x8 + c4x10    
+    mulsd   %xmm0, %xmm4                              # x12    
+    mulsd   %xmm5, %xmm4                              # c5x12 + c6x14
+
+    movsd   .L__real_3ff0000000000000(%rip), %xmm0    # 1    
+    addsd   %xmm3, %xmm1                              # c1x4 + c2x6 + c3x8 + c4x10
+    movsd   %xmm2, %xmm3                              # preserve -0.5x2
+    addsd   %xmm0, %xmm2                              # t = 1 - 0.5x2
+    subsd   %xmm2, %xmm0                              # 1-t
+    addsd   %xmm3, %xmm0                              # (1-t) - r
+    addsd   %xmm4, %xmm1                              # c1x4 + c2x6 + c3x8 + c4x10 + c5x12 + c6x14
+    addsd   %xmm1, %xmm0                              # (1-t) - r + c1x4 + c2x6 + c3x8 + c4x10 + c5x12 + c6x14
+    addsd   %xmm2, %xmm0                              # 1 - 0.5x2 + above
+
+    jmp     .Lcos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcos_reduce:
+
+#  xneg = (ax != ux);
+    cmp     %r10, %rdx
+
+##  if (xneg) x = -x;
+    jz      .Lpositive
+    subsd   %xmm0, %xmm2
+    movsd   %xmm2, %xmm0
+
+.Lpositive:
+##  if (x < 5.0e5)
+    cmp     .L__real_411E848000000000(%rip), %r10
+    jae     .Lcos_reduce_precise
+
+# reduce  the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+    movsd   %xmm0, %xmm2
+    movsd   .L__real_3fe45f306dc9c883(%rip), %xmm3    # twobypi
+    movsd   %xmm0, %xmm4
+    movsd   .L__real_3fe0000000000000(%rip), %xmm5    # .5
+    mulsd   %xmm3, %xmm2
+
+#/* How many pi/2 is x a multiple of? */
+#      xexp  = ax >> EXPSHIFTBITS_DP64;
+    mov     %r10, %r9
+    shr     $52, %r9                                  # >>EXPSHIFTBITS_DP64
+
+#        npi2  = (int)(x * twobypi + 0.5);
+    addsd   %xmm5, %xmm2                              # npi2
+
+    movsd   .L__real_3ff921fb54400000(%rip), %xmm3    # piby2_1
+    cvttpd2dq    %xmm2, %xmm0                         # convert to integer
+    movsd   .L__real_3dd0b4611a626331(%rip), %xmm1    # piby2_1tail
+    cvtdq2pd    %xmm0, %xmm2                          # and back to float.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+#      rhead  = x - npi2 * piby2_1;
+    mulsd   %xmm2, %xmm3
+    subsd   %xmm3, %xmm4                              # rhead
+
+#      rtail  = npi2 * piby2_1tail;
+    mulsd   %xmm2, %xmm1
+    movd    %xmm0, %eax
+
+#      GET_BITS_DP64(rhead-rtail, uy);             
+    movsd   %xmm4, %xmm0
+    subsd   %xmm1, %xmm0
+
+    movsd   .L__real_3dd0b4611a600000(%rip), %xmm3    # piby2_2
+    movsd   %xmm0,p_temp(%rsp)
+    movsd   .L__real_3ba3198a2e037073(%rip), %xmm5    # piby2_2tail
+    mov     p_temp(%rsp), %rcx                        # rcx is rhead-rtail
+
+#    xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc
+#      expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+    shl     $1, %rcx                                  # strip any sign bit
+    shr     $53, %rcx                                 # >> EXPSHIFTBITS_DP64 +1
+    sub     %rcx, %r9                                 # expdiff
+
+##      if (expdiff > 15)
+    cmp     $15, %r9
+    jle     .Lexpdiffless15
+
+#          /* The remainder is pretty small compared with x, which
+#             implies that x is a near multiple of pi/2
+#             (x matches the multiple to at least 15 bits) */
+
+#          t  = rhead;
+    movsd   %xmm4, %xmm1
+
+#          rtail  = npi2 * piby2_2;
+    mulsd   %xmm2, %xmm3
+
+#          rhead  = t - rtail;
+    mulsd   %xmm2, %xmm5                              # npi2 * piby2_2tail
+    subsd   %xmm3, %xmm4                              # rhead
+
+#          rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+    subsd   %xmm4, %xmm1                              # t - rhead
+    subsd   %xmm3, %xmm1                              # -rtail
+    subsd   %xmm1, %xmm5                              # rtail
+
+#      r = rhead - rtail;
+    movsd   %xmm4, %xmm0
+
+#HARSHA
+#xmm1=rtail
+    movsd   %xmm5, %xmm1
+    subsd   %xmm5, %xmm0
+
+#    xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexpdiffless15:
+#      region = npi2 & 3;
+
+    subsd   %xmm0, %xmm4                              # rhead-r
+    subsd   %xmm1, %xmm4                              # rr = (rhead-r) - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+## if the input was close to a pi/2 multiple
+# The original NAG code missed this trick.  If the input is very close to n*pi/2 after
+# reduction,
+# then the cos is ~ 1.0 , to within 53 bits, when r is < 2^-27.  We already
+# have x at this point, so we can skip the cos polynomials.
+
+    cmp     $0x03f2, %rcx                             # if r  small.
+    jge     .Lcos_piby4                               # use taylor series if not
+    cmp     $0x03de, %rcx                             # if r really small.
+    jle     .Lr_small                                 # then cos(r) = 1
+
+    movsd   %xmm0, %xmm2
+    mulsd   %xmm2, %xmm2                              # x^2
+
+##      if region is 1 or 3    do a sin calc.
+    and     %eax, %r8d
+    jz      .Lsinsmall
+
+# region 1 or 3
+# use simply polynomial
+#              *s = x - x*x*x*0.166666666666666666;
+    movsd   .L__real_3fc5555555555555(%rip), %xmm3    
+    mulsd   %xmm0, %xmm3                              # * x
+    mulsd   %xmm2, %xmm3                              # * x^2
+    subsd   %xmm3, %xmm0                              # xs
+    jmp     .Ladjust_region
+
+.align 16
+.Lsinsmall:
+# region 0 or 2
+#              cos = 1.0 - x*x*0.5;
+    movsd   .L__real_3ff0000000000000(%rip), %xmm0  # 1.0
+    mulsd   .L__real_3fe0000000000000(%rip), %xmm2  # 0.5 *x^2
+    subsd   %xmm2, %xmm0
+    jmp     .Ladjust_region
+
+.align 16
+.Lr_small:
+##      if region is 1 or 3    do a sin calc.
+    and     %eax, %r8d
+    jnz     .Ladjust_region
+
+    movsd   .L__real_3ff0000000000000(%rip), %xmm0    # cos(r) is a 1
+    jmp     .Ladjust_region
+
+.align 32
+.Lcos_reduce_precise:
+#      // Reduce x into range [-pi/4,pi/4]
+#      __amd_remainder_piby2(x, &r, &rr, &region);
+
+    lea     region(%rsp), %rdx
+    lea     rr(%rsp), %rsi
+    lea     r(%rsp), %rdi
+        
+    call    __amd_remainder_piby2@PLT
+
+    mov     $1, %r8d                                  # for determining region later on
+    movsd   r(%rsp), %xmm0                            # x
+    movsd   rr(%rsp), %xmm4                           # xx
+    mov     region(%rsp), %eax                        # region
+
+# xmm0 = x, xmm4 = xx, r8d = 1, eax= region
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 32
+# perform taylor series to calc sinx, cosx
+.Lcos_piby4:
+#  x2 = r * r;
+
+#xmm4 = a part of rr for the sin path, xmm4 is overwritten in the cos path
+#instead use xmm3 because that was freed up in the sin path, xmm3 is overwritten in sin path
+    movsd   %xmm0, %xmm3
+    movsd   %xmm0, %xmm2
+    mulsd   %xmm0, %xmm2                              # x2
+
+##      if region is 1 or 3    do a sin calc.
+    and     %eax, %r8d
+    jz      .Lcospiby4
+
+# region 1 or 3
+    movsd   .Lsinarray+0x50(%rip), %xmm3              # s6
+    mulsd   %xmm2, %xmm3                              # x2s6
+    movsd   .Lsinarray+0x20(%rip), %xmm5              # s3
+    movsd   %xmm4,p_temp(%rsp)                        # store xx
+    movsd   %xmm2, %xmm1                              # move for x4
+    mulsd   %xmm2, %xmm1                              # x4
+    movsd   %xmm0,p_temp1(%rsp)                       # store x
+    mulsd   %xmm2, %xmm5                              # x2s3
+    movsd   %xmm0, %xmm4                              # move for x3
+    addsd   .Lsinarray+0x40(%rip), %xmm3              # s5+x2s6
+    mulsd   %xmm2, %xmm1                              # x6
+    mulsd   %xmm2, %xmm3                              # x2(s5+x2s6)
+    mulsd   %xmm2, %xmm4                              # x3
+    addsd   .Lsinarray+0x10(%rip), %xmm5              # s2+x2s3
+    mulsd   %xmm2, %xmm5                              # x2(s2+x2s3)
+    addsd   .Lsinarray+0x30(%rip), %xmm3              # s4 + x2(s5+x2s6)
+    mulsd   .L__real_3fe0000000000000(%rip), %xmm2    # 0.5 *x2
+    movsd   p_temp(%rsp), %xmm0                       # load xx
+    mulsd   %xmm1, %xmm3                              # x6(s4 + x2(s5+x2s6))
+    addsd   .Lsinarray(%rip), %xmm5                   # s1+x2(s2+x2s3)
+    mulsd   %xmm0, %xmm2                              # 0.5 * x2 *xx
+    addsd   %xmm5, %xmm3                              # zs
+    mulsd   %xmm3, %xmm4                              # *x3
+    subsd   %xmm2, %xmm4                              # x3*zs - 0.5 * x2 *xx
+    addsd   %xmm4, %xmm0                              # +xx
+    addsd   p_temp1(%rsp), %xmm0                      # +x
+    
+    jmp     .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcospiby4:
+    
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2     - do a cos calculation
+#  zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6))));
+    mulsd   %xmm0, %xmm4                              # x*xx
+    movsd   .L__real_3fe0000000000000(%rip), %xmm5
+    movsd   .Lcosarray+0x50(%rip), %xmm1              # c6
+    movsd   .Lcosarray+0x20(%rip), %xmm0              # c3
+    mulsd   %xmm2, %xmm5                              # r = 0.5 *x2
+    movsd   %xmm2, %xmm3                              # copy of x2
+    movsd   %xmm4,p_temp(%rsp)                        # store x*xx
+    mulsd   %xmm2, %xmm1                              # c6*x2
+    mulsd   %xmm2, %xmm0                              # c3*x2
+    subsd   .L__real_3ff0000000000000(%rip), %xmm5    # -t=r-1.0    ;trash r
+    mulsd   %xmm2, %xmm3                              # x4
+    addsd   .Lcosarray+0x40(%rip), %xmm1              # c5+x2c6
+    addsd   .Lcosarray+0x10(%rip), %xmm0              # c2+x2C3
+    addsd   .L__real_3ff0000000000000(%rip), %xmm5    # 1 + (-t)    ;trash t
+    mulsd   %xmm2, %xmm3                              # x6
+    mulsd   %xmm2, %xmm1                              # x2(c5+x2c6)
+    mulsd   %xmm2, %xmm0                              # x2(c2+x2C3)
+    movsd   %xmm2, %xmm4                              # copy of x2
+    mulsd   .L__real_3fe0000000000000(%rip), %xmm4    # r recalculate
+    addsd   .Lcosarray+0x30(%rip), %xmm1              # c4 + x2(c5+x2c6)
+    addsd   .Lcosarray(%rip), %xmm0                   # c1+x2(c2+x2C3)
+    mulsd   %xmm2, %xmm2                              # x4 recalculate
+    subsd   %xmm4, %xmm5                              # (1 + (-t)) - r
+    mulsd   %xmm3, %xmm1                              # x6(c4 + x2(c5+x2c6))
+    addsd   %xmm1, %xmm0                              # zc
+    subsd   .L__real_3ff0000000000000(%rip), %xmm4    # t relaculate
+    subsd   p_temp(%rsp), %xmm5                       # ((1 + (-t)) - r) - x*xx
+    mulsd   %xmm2, %xmm0                              # x4 * zc
+    addsd   %xmm5, %xmm0                              # x4 * zc + ((1 + (-t)) - r -x*xx)
+    subsd   %xmm4, %xmm0                              # result - (-t)
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 32
+.Ladjust_region:        # positive or negative (0, 1, 2, 3)=>(1, 2, 3 ,4)=>(0, 2, 2,0)
+#      switch (region)
+    add     $1, %eax
+    and     $2, %eax
+    jz      .Lcos_cleanup
+## if the original region 1 or 2 then we negate the result.
+    movsd   %xmm0, %xmm2
+    xorpd   %xmm0, %xmm0
+    subsd   %xmm2, %xmm0
+
+.align 32
+.Lcos_cleanup:
+    add     $stack_size, %rsp
+    ret
+
+.align 32
+.Lcos_naninf:
+   call     fname_special
+   add      $stack_size, %rsp
+   ret
+
+
+

diff --git a/src/gas/cosf.S b/src/gas/cosf.S
new file mode 100644
index 0000000..43eae9a
--- /dev/null
+++ b/src/gas/cosf.S

@@ -0,0 +1,372 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# An implementation of the cosf function.
+#
+# Prototype:
+#
+#     float fastcosf(float x);
+#
+#   Computes cosf(x).  
+#   Based on the NAG C implementation.
+#   It will provide proper C99 return values,
+#   but may not raise floating point status bits properly.
+#   Author: Harsha Jagasia
+#   Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 32
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000  # 1.0
+                        .quad 0                       # for alignment
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000  # 0.5
+                        .quad 0
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555  # 0.166666666666
+                        .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883  # twobypi
+                        .quad 0
+.L__real_3FF921FB54442D18: .quad 0x03FF921FB54442D18  # piby2
+                        .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000  # piby2_1
+                        .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331  # piby2_1tail
+                        .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000  # piby2_2
+                        .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073  # piby2_2tail
+                        .quad 0                                         
+.L__real_411E848000000000: .quad 0x415312d000000000   # 5e6 0x0411E848000000000  # 5e5
+                        .quad 0
+                        
+.align 32
+.Lcsarray:
+    .quad    0x0bfc5555555555555                      # -0.166667        s1
+    .quad    0x03fa5555555555555                      # 0.0416667        c1
+    .quad    0x03f81111111110bb3                      # 0.00833333       s2
+    .quad    0x0bf56c16c16c16967                      # -0.00138889      c2
+    .quad    0x0bf2a01a019e83e5c                      # -0.000198413     s3
+    .quad    0x03efa01a019f4ec90                      # 2.48016e-005     c3
+    .quad    0x03ec71de3796cde01                      # 2.75573e-006     s4
+    .quad    0x0be927e4fa17f65f6                      # -2.75573e-007    c4
+
+.text
+.align 32
+.p2align 5,,31
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(cosf)
+#define fname_special _cosf_special@PLT
+
+# define local variable storage offsets
+.equ    p_temp,    0x30                               # temporary for get/put bits operation
+.equ    p_temp1,   0x40                               # temporary for get/put bits operation
+.equ    region,    0x50                               # pointer to region for amd_remainder_piby2
+.equ    r,         0x60                               # pointer to r for amd_remainder_piby2
+.equ   stack_size, 0x88
+
+.globl fname
+.type  fname,@function
+
+fname:
+
+    sub     $stack_size, %rsp
+ 
+##  if NaN or inf
+    movd    %xmm0, %edx
+    mov     $0x07f800000, %eax
+    mov     %eax, %r10d
+    and     %edx, %r10d
+    cmp     %eax, %r10d
+    jz      .Lcosf_naninf
+   
+    xorpd   %xmm2, %xmm2
+    mov     %rdx, %r11                                # save 1st return value pointer
+
+#  GET_BITS_DP64(x, ux);
+# convert input to double.
+    cvtss2sd    %xmm0, %xmm0
+
+# get the input value to an integer register.
+    movsd   %xmm0,p_temp(%rsp)
+    mov     p_temp(%rsp), %rdx                        # rdx is ux
+
+#  ax = (ux & ~SIGNBIT_DP64);
+    mov     $0x07fffffffffffffff, %r10
+    and     %rdx, %r10                                # r10 is ax
+
+    mov     $1, %r8d                                  # for determining region later on
+    movsd   %xmm0, %xmm1                              # copy x to xmm1
+
+
+##  if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+    mov     $0x03fe921fb54442d18, %rax
+    cmp     %rax, %r10
+    jg      .L__sc_reducec
+
+#          *c = cos_piby4(x, 0.0);
+    movsd   %xmm0, %xmm2 
+    mulsd   %xmm2, %xmm2                              # x^2
+    xor     %eax, %eax
+    mov     %r10, %rdx
+    movsd   .L__real_3fe0000000000000(%rip), %xmm5    # .5
+    jmp     .L__sc_piby4c       
+    
+.align 32
+.L__sc_reducec:    
+# reduce  the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+#  xneg = (ax != ux);
+    cmp     %r10, %rdx
+##  if (xneg) x = -x;
+    jz      .Lpositive
+    subsd   %xmm0, %xmm2
+    movsd   %xmm2, %xmm0
+
+.Lpositive:
+##  if (x < 5.0e5)
+    cmp     .L__real_411E848000000000(%rip), %r10
+    jae     .Lcosf_reduce_precise
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+# perform taylor series to calc cosx, cosx
+# xmm0=abs(x), xmm1=x
+.align 32
+.Lcosf_piby4:
+#/* How many pi/2 is x a multiple of? */
+#        npi2  = (int)(x * twobypi + 0.5);
+
+    movsd   %xmm0, %xmm2
+    movsd   %xmm0, %xmm4
+
+    mulsd   .L__real_3fe45f306dc9c883(%rip), %xmm2    # twobypi
+    movsd   .L__real_3fe0000000000000(%rip), %xmm5    # .5 
+
+#/* How many pi/2 is x a multiple of? */
+
+#      xexp  = ax >> EXPSHIFTBITS_DP64;
+    mov     %r10, %r9
+    shr     $52, %r9                                  # >> EXPSHIFTBITS_DP64
+
+#        npi2  = (int)(x * twobypi + 0.5);
+    addsd   %xmm5, %xmm2                              # npi2
+    
+    movsd   .L__real_3ff921fb54400000(%rip), %xmm3    # piby2_1
+    cvttpd2dq    %xmm2, %xmm0                         # convert to integer 
+    movsd   .L__real_3dd0b4611a626331(%rip), %xmm1    # piby2_1tail    
+    cvtdq2pd    %xmm0, %xmm2                          # and back to double    
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+#      rhead  = x - npi2 * piby2_1;
+
+    mulsd   %xmm2, %xmm3                              # use piby2_1
+    subsd   %xmm3, %xmm4                              # rhead
+
+#      rtail  = npi2 * piby2_1tail;
+    mulsd   %xmm2, %xmm1                              # rtail
+    movd    %xmm0, %eax
+
+#      GET_BITS_DP64(rhead-rtail, uy);     
+    movsd   %xmm4, %xmm0
+    subsd   %xmm1, %xmm0
+
+    movsd   .L__real_3dd0b4611a600000(%rip), %xmm3    # piby2_2
+    movsd   .L__real_3ba3198a2e037073(%rip), %xmm5    # piby2_2tail
+    movd    %xmm0, %rcx                               # rcx is rhead-rtail
+
+#      expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+    shl     $1, %rcx                                  # strip any sign bit
+    shr     $53, %rcx                                 # >> EXPSHIFTBITS_DP64 +1
+    sub     %rcx, %r9                                 # expdiff
+
+##      if (expdiff > 15)
+    cmp     $15, %r9
+    jle     .Lexpdiffless15
+
+#          /* The remainder is pretty small compared with x, which
+#             implies that x is a near multiple of pi/2
+#             (x matches the multiple to at least 15 bits) */
+
+#          t  = rhead;
+    movsd   %xmm4, %xmm1
+
+#          rtail  = npi2 * piby2_2;
+    mulsd   %xmm2, %xmm3
+
+#          rhead  = t - rtail;
+    mulsd   %xmm2, %xmm5                              # npi2 * piby2_2tail
+    subsd   %xmm3, %xmm4                              # rhead
+
+#          rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+    subsd   %xmm4, %xmm1                              # t - rhead
+    subsd   %xmm3, %xmm1                              # -rtail
+    subsd   %xmm1, %xmm5                              # rtail
+
+#      r = rhead - rtail;
+    movsd   %xmm4, %xmm0
+
+#HARSHA
+#xmm1=rtail
+    movsd   %xmm5, %xmm1
+    subsd   %xmm5, %xmm0
+
+#    xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexpdiffless15:
+#      region = npi2 & 3;
+
+    movsd   %xmm0, %xmm2 
+    mulsd   %xmm0, %xmm2        #x^2
+    movsd   %xmm0, %xmm1
+    movsd   .L__real_3fe0000000000000(%rip), %xmm5    # .5
+    
+    cmp     $0x03f2, %rcx                             # if r  small.
+    jge     .L__sc_piby4c                             # use taylor series if not
+    cmp     $0x03de, %rcx                             # if r really small.
+    jle     .L__rc_small                              # then cos(r) = 1
+
+##      if region is 1 or 3    do a sin calc.
+    and     %eax, %r8d
+    jz      .Lsinsmall
+# region 1 or 3
+# use simply polynomial
+#              *s = x - x*x*x*0.166666666666666666;
+    movsd   .L__real_3fc5555555555555(%rip), %xmm3    
+    mulsd   %xmm1, %xmm3                              # * x
+    mulsd   %xmm2, %xmm3                              # * x^2
+    subsd   %xmm3, %xmm1                              # xs
+    jmp     .L__adjust_region_cos
+    
+.align 16
+.Lsinsmall:
+# region 0 or 2
+#              cos = 1.0 - x*x*0.5;
+    movsd   .L__real_3ff0000000000000(%rip), %xmm1    # 1.0
+    mulsd   .L__real_3fe0000000000000(%rip), %xmm2    # 0.5 *x^2
+    subsd   %xmm2, %xmm1
+    jmp     .L__adjust_region_cos
+
+.align 16
+.L__rc_small:                                         # then sin(r) = r
+##     if region is 1 or 3   do a sin calc.
+   and      %eax, %r8d
+   jnz      .L__adjust_region_cos
+   movsd    .L__real_3ff0000000000000(%rip), %xmm1    # cos(r) is a 1 
+   jmp      .L__adjust_region_cos
+
+
+# done with reducing the argument.  Now perform the sin/cos calculations.
+.align 16
+.L__sc_piby4c:
+##     if region is 1 or 3   do a sin calc.
+   and      %eax, %r8d
+   jz       .Lcospiby4
+   
+   movsd    .Lcsarray+0x30(%rip), %xmm1               # c4
+   movsd    %xmm2, %xmm4
+   mulsd    %xmm2, %xmm1                              # x2c4
+   movsd    .Lcsarray+0x10(%rip), %xmm3               # c2
+   mulsd    %xmm4, %xmm4                              # x4   
+   mulsd    %xmm2, %xmm3                              # x2c2
+   mulsd    %xmm0, %xmm2                              # x3
+   addsd    .Lcsarray+0x20(%rip), %xmm1               # c3 + x2c4   
+   mulsd    %xmm4, %xmm1                              # x4(c3 + x2c4)
+   addsd    .Lcsarray(%rip), %xmm3                    # c1 + x2c2
+   addsd    %xmm3, %xmm1                              # c1 + c2x2 + c3x4 + c4x6
+   mulsd    %xmm2, %xmm1                              # c1x3 + c2x5 + c3x7 + c4x9
+   addsd    %xmm0, %xmm1                              # x + c1x3 + c2x5 + c3x7 + c4x9
+
+   jmp      .L__adjust_region_cos
+   
+.align 16
+.Lcospiby4:   
+# region 0 or 2    - do a cos calculation
+   movsd    .Lcsarray+0x38(%rip), %xmm1               # c4
+   movsd    %xmm2, %xmm4
+   mulsd    %xmm2, %xmm1                              # x2c4
+   movsd    .Lcsarray+0x18(%rip), %xmm3               # c2
+   mulsd    %xmm4, %xmm4                              # x4
+   mulsd    %xmm2, %xmm3                              # x2c2
+   mulsd    %xmm2, %xmm5                              # 0.5 * x2
+   addsd    .Lcsarray+0x28(%rip), %xmm1               # c3 + x2c4
+   mulsd    %xmm4, %xmm1                              # x4(c3 + x2c4)   
+   addsd    .Lcsarray+8(%rip), %xmm3                  # c1 + x2c2
+   addsd    %xmm3, %xmm1                              # c1 + x2c2 + c3x4 + c4x6
+   mulsd    %xmm4, %xmm1                              # x4(c1 + c2x2 + c3x4 + c4x6)
+
+#  -t = rc-1;
+   subsd    .L__real_3ff0000000000000(%rip), %xmm5    # 0.5x2 - 1
+   subsd    %xmm5, %xmm1                              # cos = 1 - 0.5x2 + c1x4 + c2x6 + c3x8 + c4x10
+
+.L__adjust_region_cos:                                # xmm1 is cos or sin, relies on previous sections to 
+#     switch (region)         
+   add      $1, %eax
+   and      $2, %eax
+   jz       .L__cos_cleanup
+## if region 1 or 2 then we negate the result.
+   xorpd    %xmm2, %xmm2
+   subsd    %xmm1, %xmm2
+   movsd    %xmm2, %xmm1
+
+.align 16   
+.L__cos_cleanup:
+   cvtsd2ss %xmm1, %xmm0
+   add      $stack_size, %rsp
+   ret
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lcosf_reduce_precise:
+#     /* Reduce abs(x) into range [-pi/4,pi/4] */
+#     __amd_remainder_piby2(ax, &r, &region);
+
+   mov      %rdx,p_temp(%rsp)                         # save ux for use later
+   mov      %r10,p_temp1(%rsp)                        # save ax for use later
+   movd     %xmm0, %rdi
+   lea      r(%rsp), %rsi
+   lea      region(%rsp), %rdx
+   sub      $0x020, %rsp   
+
+   call     __amd_remainder_piby2d2f@PLT
+
+   add      $0x020, %rsp
+   mov      p_temp(%rsp), %rdx                        # restore ux for use later
+   mov      p_temp1(%rsp), %r10                       # restore ax for use later   
+   mov      $1, %r8d                                  # for determining region later on
+   movsd    r(%rsp), %xmm0                            # r
+   mov      region(%rsp), %eax                        # region
+
+   movsd    %xmm0, %xmm2 
+   mulsd    %xmm0, %xmm2                              # x^2
+   movsd    %xmm0, %xmm1
+   movsd    .L__real_3fe0000000000000(%rip), %xmm5    # .5
+
+   jmp      .L__sc_piby4c
+
+.align 32
+.Lcosf_naninf:
+   call     fname_special
+   add      $stack_size, %rsp
+   ret

diff --git a/src/gas/exp.S b/src/gas/exp.S
new file mode 100644
index 0000000..153e8a6
--- /dev/null
+++ b/src/gas/exp.S

@@ -0,0 +1,400 @@
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+#ifdef __x86_64__
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# exp.S
+#
+# An implementation of the exp libm function.
+#
+# Prototype:
+#
+#     double exp(double x);
+#
+
+#
+#   Algorithm:
+#   
+#   e^x = 2^(x/ln(2)) = 2^(x*(64/ln(2))/64)
+#
+#   x*(64/ln(2)) = n + f, |f| <= 0.5, n is integer
+#   n = 64*m + j,   0 <= j < 64
+#   
+#   e^x = 2^((64*m + j + f)/64)
+#       = (2^m) * (2^(j/64)) * 2^(f/64)
+#       = (2^m) * (2^(j/64)) * e^(f*(ln(2)/64))
+#
+#   f = x*(64/ln(2)) - n
+#   r = f*(ln(2)/64) = x - n*(ln(2)/64)
+#
+#   e^x = (2^m) * (2^(j/64)) * e^r
+#
+#   (2^(j/64)) is precomputed
+#
+#   e^r = 1 + r + (r^2)/2! + (r^3)/3! + (r^4)/4! + (r^5)/5! + (r^5)/5!
+#   e^r = 1 + q
+#
+#   q = r + (r^2)/2! + (r^3)/3! + (r^4)/4! + (r^5)/5! + (r^5)/5!
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(exp)
+#define fname_special _exp_special@PLT
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+    ucomisd      .L__max_exp_arg(%rip), %xmm0
+    ja           .L__y_is_inf
+    jp           .L__y_is_nan
+    ucomisd      .L__denormal_tiny_threshold(%rip), %xmm0
+    jbe          .L__y_is_zero
+
+    # x * (64/ln(2))
+    movapd      %xmm0,%xmm1        
+    mulsd       .L__real_64_by_log2(%rip), %xmm1
+
+    # n = int( x * (64/ln(2)) )
+    cvttpd2dq    %xmm1, %xmm2   #xmm2 = (int)n
+    cvtdq2pd    %xmm2, %xmm1   #xmm1 = (double)n
+    movd        %xmm2, %ecx
+    movapd     %xmm1,%xmm2
+    # r1 = x - n * ln(2)/64 head    
+    mulsd    .L__log2_by_64_mhead(%rip),%xmm1
+        
+    #j = n & 0x3f    
+    mov         $0x3f, %rax
+    and         %ecx, %eax     #eax = j
+    # m = (n - j) / 64      
+    sar         $6, %ecx       #ecx = m
+        
+
+    # r2 = - n * ln(2)/64 tail
+    mulsd    .L__log2_by_64_mtail(%rip),%xmm2
+    addsd    %xmm1,%xmm0   #xmm0 = r1
+
+    # r1+r2
+    addsd       %xmm0, %xmm2 #xmm2 = r
+
+    # q = r + r^2*1/2 + r^3*1/6 + r^4 *1/24 + r^5*1/120 + r^6*1/720
+    # q = r + r*r*(1/2 + r*(1/6+ r*(1/24 + r*(1/120 + r*(1/720)))))
+    movapd       .L__real_1_by_720(%rip), %xmm3  #xmm3 = 1/720
+    mulsd       %xmm2, %xmm3    #xmm3 = r*1/720
+    movapd       .L__real_1_by_6(%rip), %xmm0    #xmm0 = 1/6    
+    movapd      %xmm2, %xmm1 #xmm1 = r            
+    mulsd       %xmm2, %xmm0    #xmm0 = r*1/6
+    addsd       .L__real_1_by_120(%rip), %xmm3  #xmm3 = 1/120 + (r*1/720)
+    mulsd       %xmm2, %xmm1    #xmm1 = r*r    
+    addsd       .L__real_1_by_2(%rip), %xmm0  #xmm0 = 1/2 + (r*1/6)        
+    movapd       %xmm1, %xmm4   #xmm4 = r*r
+    mulsd       %xmm1, %xmm4    #xmm4 = (r*r) * (r*r)    
+    mulsd       %xmm2, %xmm3    #xmm3 = r * (1/120 + (r*1/720))
+    mulsd       %xmm1, %xmm0    #xmm0 = (r*r)*(1/2 + (r*1/6))
+    addsd       .L__real_1_by_24(%rip), %xmm3  #xmm3 = 1/24 + (r * (1/120 + (r*1/720)))
+    addsd       %xmm2, %xmm0   #xmm0 = r + ((r*r)*(1/2 + (r*1/6)))
+    mulsd       %xmm4, %xmm3   #xmm3 = ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+    addsd       %xmm3, %xmm0   #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+    
+    # (f)*(q) + f2 + f1
+    cmp         $0xfffffc02, %ecx # -1022    
+    lea         .L__two_to_jby64_table(%rip), %rdx        
+    lea         .L__two_to_jby64_tail_table(%rip), %r11       
+    lea         .L__two_to_jby64_head_table(%rip), %r10      
+    mulsd       (%rdx,%rax,8), %xmm0
+    addsd       (%r11,%rax,8), %xmm0
+    addsd       (%r10,%rax,8), %xmm0        
+
+    jle         .L__process_denormal 
+.L__process_normal:
+    shl         $52, %rcx    
+    movd        %rcx,%xmm2
+    paddq       %xmm2, %xmm0
+    ret
+
+.p2align 4
+.L__process_denormal:
+    jl          .L__process_true_denormal
+    ucomisd     .L__real_one(%rip), %xmm0
+    jae         .L__process_normal
+.L__process_true_denormal:
+    # here ( e^r < 1 and m = -1022 ) or m <= -1023
+    add         $1074, %ecx
+    mov         $1, %rax    
+    shl         %cl, %rax
+    movd         %rax, %xmm2
+    mulsd       %xmm2, %xmm0
+    ret        
+    
+.p2align 4
+.L__y_is_inf:
+    mov         $0x7ff0000000000000,%rax
+    movd       %rax, %xmm1
+    mov         $3, %edi
+    jmp         fname_special
+
+.p2align 4
+.L__y_is_nan:
+    movapd      %xmm0,%xmm1
+    addsd       %xmm0,%xmm1
+    mov         $1, %edi
+    jmp         fname_special
+
+.p2align 4
+.L__y_is_zero:
+    ucomisd     .L__min_exp_arg(%rip),%xmm0
+    jbe          .L__return_zero
+    movapd       .L__real_smallest_denormal(%rip), %xmm0
+    ret
+    
+.p2align 4        
+.L__return_zero:    
+    pxor        %xmm1,%xmm1
+    mov         $2, %edi
+    jmp         fname_special
+    
+.data
+.align 16
+.L__max_exp_arg:            .quad 0x40862e42fefa39ef
+.L__denormal_tiny_threshold:  .quad 0xc0874046dfefd9d0
+.L__min_exp_arg:            .quad 0xc0874910d52d3051
+.L__real_64_by_log2:        .quad 0x40571547652b82fe    # 64/ln(2)
+
+.align 16
+.L__log2_by_64_mhead: .quad 0xbf862e42fefa0000
+.L__log2_by_64_mtail: .quad 0xbd1cf79abc9e3b39
+.L__real_1_by_720:              .quad 0x3f56c16c16c16c17    # 1/720
+.L__real_1_by_120:              .quad 0x3f81111111111111    # 1/120
+.L__real_1_by_6:                .quad 0x3fc5555555555555    # 1/6
+.L__real_1_by_2:                .quad 0x3fe0000000000000    # 1/2
+.L__real_1_by_24:               .quad 0x3fa5555555555555    # 1/24
+.L__real_one:                   .quad 0x3ff0000000000000
+.L__real_smallest_denormal:     .quad 0x0000000000000001
+
+
+.align 16
+.L__two_to_jby64_table:
+    .quad 0x3ff0000000000000
+    .quad 0x3ff02c9a3e778061
+    .quad 0x3ff059b0d3158574
+    .quad 0x3ff0874518759bc8
+    .quad 0x3ff0b5586cf9890f
+    .quad 0x3ff0e3ec32d3d1a2
+    .quad 0x3ff11301d0125b51
+    .quad 0x3ff1429aaea92de0
+    .quad 0x3ff172b83c7d517b
+    .quad 0x3ff1a35beb6fcb75
+    .quad 0x3ff1d4873168b9aa
+    .quad 0x3ff2063b88628cd6
+    .quad 0x3ff2387a6e756238
+    .quad 0x3ff26b4565e27cdd
+    .quad 0x3ff29e9df51fdee1
+    .quad 0x3ff2d285a6e4030b
+    .quad 0x3ff306fe0a31b715
+    .quad 0x3ff33c08b26416ff
+    .quad 0x3ff371a7373aa9cb
+    .quad 0x3ff3a7db34e59ff7
+    .quad 0x3ff3dea64c123422
+    .quad 0x3ff4160a21f72e2a
+    .quad 0x3ff44e086061892d
+    .quad 0x3ff486a2b5c13cd0
+    .quad 0x3ff4bfdad5362a27
+    .quad 0x3ff4f9b2769d2ca7
+    .quad 0x3ff5342b569d4f82
+    .quad 0x3ff56f4736b527da
+    .quad 0x3ff5ab07dd485429
+    .quad 0x3ff5e76f15ad2148
+    .quad 0x3ff6247eb03a5585
+    .quad 0x3ff6623882552225
+    .quad 0x3ff6a09e667f3bcd
+    .quad 0x3ff6dfb23c651a2f
+    .quad 0x3ff71f75e8ec5f74
+    .quad 0x3ff75feb564267c9
+    .quad 0x3ff7a11473eb0187
+    .quad 0x3ff7e2f336cf4e62
+    .quad 0x3ff82589994cce13
+    .quad 0x3ff868d99b4492ed
+    .quad 0x3ff8ace5422aa0db
+    .quad 0x3ff8f1ae99157736
+    .quad 0x3ff93737b0cdc5e5
+    .quad 0x3ff97d829fde4e50
+    .quad 0x3ff9c49182a3f090
+    .quad 0x3ffa0c667b5de565
+    .quad 0x3ffa5503b23e255d
+    .quad 0x3ffa9e6b5579fdbf
+    .quad 0x3ffae89f995ad3ad
+    .quad 0x3ffb33a2b84f15fb
+    .quad 0x3ffb7f76f2fb5e47
+    .quad 0x3ffbcc1e904bc1d2
+    .quad 0x3ffc199bdd85529c
+    .quad 0x3ffc67f12e57d14b
+    .quad 0x3ffcb720dcef9069
+    .quad 0x3ffd072d4a07897c
+    .quad 0x3ffd5818dcfba487
+    .quad 0x3ffda9e603db3285
+    .quad 0x3ffdfc97337b9b5f
+    .quad 0x3ffe502ee78b3ff6
+    .quad 0x3ffea4afa2a490da
+    .quad 0x3ffefa1bee615a27
+    .quad 0x3fff50765b6e4540
+    .quad 0x3fffa7c1819e90d8
+
+.align 16
+.L__two_to_jby64_head_table:
+    .quad 0x3ff0000000000000
+    .quad 0x3ff02c9a30000000
+    .quad 0x3ff059b0d0000000
+    .quad 0x3ff0874510000000
+    .quad 0x3ff0b55860000000
+    .quad 0x3ff0e3ec30000000
+    .quad 0x3ff11301d0000000
+    .quad 0x3ff1429aa0000000
+    .quad 0x3ff172b830000000
+    .quad 0x3ff1a35be0000000
+    .quad 0x3ff1d48730000000
+    .quad 0x3ff2063b80000000
+    .quad 0x3ff2387a60000000
+    .quad 0x3ff26b4560000000
+    .quad 0x3ff29e9df0000000
+    .quad 0x3ff2d285a0000000
+    .quad 0x3ff306fe00000000
+    .quad 0x3ff33c08b0000000
+    .quad 0x3ff371a730000000
+    .quad 0x3ff3a7db30000000
+    .quad 0x3ff3dea640000000
+    .quad 0x3ff4160a20000000
+    .quad 0x3ff44e0860000000
+    .quad 0x3ff486a2b0000000
+    .quad 0x3ff4bfdad0000000
+    .quad 0x3ff4f9b270000000
+    .quad 0x3ff5342b50000000
+    .quad 0x3ff56f4730000000
+    .quad 0x3ff5ab07d0000000
+    .quad 0x3ff5e76f10000000
+    .quad 0x3ff6247eb0000000
+    .quad 0x3ff6623880000000
+    .quad 0x3ff6a09e60000000
+    .quad 0x3ff6dfb230000000
+    .quad 0x3ff71f75e0000000
+    .quad 0x3ff75feb50000000
+    .quad 0x3ff7a11470000000
+    .quad 0x3ff7e2f330000000
+    .quad 0x3ff8258990000000
+    .quad 0x3ff868d990000000
+    .quad 0x3ff8ace540000000
+    .quad 0x3ff8f1ae90000000
+    .quad 0x3ff93737b0000000
+    .quad 0x3ff97d8290000000
+    .quad 0x3ff9c49180000000
+    .quad 0x3ffa0c6670000000
+    .quad 0x3ffa5503b0000000
+    .quad 0x3ffa9e6b50000000
+    .quad 0x3ffae89f90000000
+    .quad 0x3ffb33a2b0000000
+    .quad 0x3ffb7f76f0000000
+    .quad 0x3ffbcc1e90000000
+    .quad 0x3ffc199bd0000000
+    .quad 0x3ffc67f120000000
+    .quad 0x3ffcb720d0000000
+    .quad 0x3ffd072d40000000
+    .quad 0x3ffd5818d0000000
+    .quad 0x3ffda9e600000000
+    .quad 0x3ffdfc9730000000
+    .quad 0x3ffe502ee0000000
+    .quad 0x3ffea4afa0000000
+    .quad 0x3ffefa1be0000000
+    .quad 0x3fff507650000000
+    .quad 0x3fffa7c180000000
+
+.align 16
+.L__two_to_jby64_tail_table:
+    .quad 0x0000000000000000
+    .quad 0x3e6cef00c1dcdef9
+    .quad 0x3e48ac2ba1d73e2a
+    .quad 0x3e60eb37901186be
+    .quad 0x3e69f3121ec53172
+    .quad 0x3e469e8d10103a17
+    .quad 0x3df25b50a4ebbf1a
+    .quad 0x3e6d525bbf668203
+    .quad 0x3e68faa2f5b9bef9
+    .quad 0x3e66df96ea796d31
+    .quad 0x3e368b9aa7805b80
+    .quad 0x3e60c519ac771dd6
+    .quad 0x3e6ceac470cd83f5
+    .quad 0x3e5789f37495e99c
+    .quad 0x3e547f7b84b09745
+    .quad 0x3e5b900c2d002475
+    .quad 0x3e64636e2a5bd1ab
+    .quad 0x3e4320b7fa64e430
+    .quad 0x3e5ceaa72a9c5154
+    .quad 0x3e53967fdba86f24
+    .quad 0x3e682468446b6824
+    .quad 0x3e3f72e29f84325b
+    .quad 0x3e18624b40c4dbd0
+    .quad 0x3e5704f3404f068e
+    .quad 0x3e54d8a89c750e5e
+    .quad 0x3e5a74b29ab4cf62
+    .quad 0x3e5a753e077c2a0f
+    .quad 0x3e5ad49f699bb2c0
+    .quad 0x3e6a90a852b19260
+    .quad 0x3e56b48521ba6f93
+    .quad 0x3e0d2ac258f87d03
+    .quad 0x3e42a91124893ecf
+    .quad 0x3e59fcef32422cbe
+    .quad 0x3e68ca345de441c5
+    .quad 0x3e61d8bee7ba46e1
+    .quad 0x3e59099f22fdba6a
+    .quad 0x3e4f580c36bea881
+    .quad 0x3e5b3d398841740a
+    .quad 0x3e62999c25159f11
+    .quad 0x3e668925d901c83b
+    .quad 0x3e415506dadd3e2a
+    .quad 0x3e622aee6c57304e
+    .quad 0x3e29b8bc9e8a0387
+    .quad 0x3e6fbc9c9f173d24
+    .quad 0x3e451f8480e3e235
+    .quad 0x3e66bbcac96535b5
+    .quad 0x3e41f12ae45a1224
+    .quad 0x3e55e7f6fd0fac90
+    .quad 0x3e62b5a75abd0e69
+    .quad 0x3e609e2bf5ed7fa1
+    .quad 0x3e47daf237553d84
+    .quad 0x3e12f074891ee83d
+    .quad 0x3e6b0aa538444196
+    .quad 0x3e6cafa29694426f
+    .quad 0x3e69df20d22a0797
+    .quad 0x3e640f12f71a1e45
+    .quad 0x3e69f7490e4bb40b
+    .quad 0x3e4ed9942b84600d
+    .quad 0x3e4bdcdaf5cb4656
+    .quad 0x3e5e2cffd89cf44c
+    .quad 0x3e452486cc2c7b9d
+    .quad 0x3e6cc2b44eee3fa4
+    .quad 0x3e66dc8a80ce9f09
+    .quad 0x3e39e90d82e90a7e
+
+#endif

diff --git a/src/gas/exp10.S b/src/gas/exp10.S
new file mode 100644
index 0000000..009bbe0
--- /dev/null
+++ b/src/gas/exp10.S

@@ -0,0 +1,366 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(exp10)
+#define fname_special _exp10_special@PLT
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+    ucomisd      .L__max_exp10_arg(%rip), %xmm0
+    jae          .L__y_is_inf
+    jp           .L__y_is_nan
+    ucomisd      .L__min_exp10_arg(%rip), %xmm0
+    jbe          .L__y_is_zero
+
+    # x * (64/log10(2))
+    movapd      %xmm0,%xmm1        
+    mulsd       .L__real_64_by_log10of2(%rip), %xmm1
+
+    # n = int( x * (64/log10(2)) )
+    cvttpd2dq    %xmm1, %xmm2   #xmm2 = (int)n
+    cvtdq2pd    %xmm2, %xmm1   #xmm1 = (double)n
+    movd        %xmm2, %ecx
+    movapd     %xmm1,%xmm2
+    # r1 = x - n * log10(2)/64 head    
+    mulsd    .L__log10of2_by_64_mhead(%rip),%xmm1
+        
+    #j = n & 0x3f    
+    mov         $0x3f, %rax
+    and         %ecx, %eax     #eax = j
+    # m = (n - j) / 64      
+    sar         $6, %ecx       #ecx = m        
+
+    # r2 = - n * log10(2)/64 tail
+    mulsd    .L__log10of2_by_64_mtail(%rip),%xmm2 #xmm2 = r2
+    addsd    %xmm1,%xmm0   #xmm0 = r1
+    
+    # r1 *= ln10;
+    # r2 *= ln10;
+    mulsd   .L__ln10(%rip),%xmm0
+    mulsd   .L__ln10(%rip),%xmm2
+
+    # r1+r2
+    addsd       %xmm0, %xmm2 #xmm2 = r
+
+    # q = r + r^2*1/2 + r^3*1/6 + r^4 *1/24 + r^5*1/120 + r^6*1/720
+    # q = r + r*r*(1/2 + r*(1/6+ r*(1/24 + r*(1/120 + r*(1/720)))))
+    movapd       .L__real_1_by_720(%rip), %xmm3  #xmm3 = 1/720
+    mulsd       %xmm2, %xmm3    #xmm3 = r*1/720
+    movapd       .L__real_1_by_6(%rip), %xmm0    #xmm0 = 1/6    
+    movapd      %xmm2, %xmm1 #xmm1 = r            
+    mulsd       %xmm2, %xmm0    #xmm0 = r*1/6
+    addsd       .L__real_1_by_120(%rip), %xmm3  #xmm3 = 1/120 + (r*1/720)
+    mulsd       %xmm2, %xmm1    #xmm1 = r*r    
+    addsd       .L__real_1_by_2(%rip), %xmm0  #xmm0 = 1/2 + (r*1/6)        
+    movapd       %xmm1, %xmm4   #xmm4 = r*r
+    mulsd       %xmm1, %xmm4    #xmm4 = (r*r) * (r*r)    
+    mulsd       %xmm2, %xmm3    #xmm3 = r * (1/120 + (r*1/720))
+    mulsd       %xmm1, %xmm0    #xmm0 = (r*r)*(1/2 + (r*1/6))
+    addsd       .L__real_1_by_24(%rip), %xmm3  #xmm3 = 1/24 + (r * (1/120 + (r*1/720)))
+    addsd       %xmm2, %xmm0   #xmm0 = r + ((r*r)*(1/2 + (r*1/6)))
+    mulsd       %xmm4, %xmm3   #xmm3 = ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+    addsd       %xmm3, %xmm0   #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+    
+    # (f)*(q) + f2 + f1
+    cmp         $0xfffffc02, %ecx # -1022    
+    lea         .L__two_to_jby64_table(%rip), %rdx        
+    lea         .L__two_to_jby64_tail_table(%rip), %r11       
+    lea         .L__two_to_jby64_head_table(%rip), %r10      
+    mulsd       (%rdx,%rax,8), %xmm0
+    addsd       (%r11,%rax,8), %xmm0
+    addsd       (%r10,%rax,8), %xmm0        
+
+    jle         .L__process_denormal 
+.L__process_normal:
+    shl         $52, %rcx    
+    movd        %rcx,%xmm2
+    paddq       %xmm2, %xmm0
+    ret
+
+.p2align 4
+.L__process_denormal:
+    jl          .L__process_true_denormal
+    ucomisd     .L__real_one(%rip), %xmm0
+    jae         .L__process_normal
+.L__process_true_denormal:
+    # here ( e^r < 1 and m = -1022 ) or m <= -1023
+    add         $1074, %ecx
+    mov         $1, %rax    
+    shl         %cl, %rax
+    movd         %rax, %xmm2
+    mulsd       %xmm2, %xmm0
+    ret        
+    
+.p2align 4
+.L__y_is_inf:
+    mov         $0x7ff0000000000000,%rax
+    movd       %rax, %xmm1
+    mov         $3, %edi
+    #call        fname_special
+    movdqa %xmm1,%xmm0 #remove this if call is made
+    ret     
+
+.p2align 4
+.L__y_is_nan:
+    movapd      %xmm0,%xmm1
+    addsd       %xmm0,%xmm1
+    mov         $1, %edi
+    #call        fname_special
+    movdqa %xmm1,%xmm0 #remove this if call is made    
+    ret
+
+.p2align 4
+.L__y_is_zero:
+    pxor        %xmm1,%xmm1
+    mov         $2, %edi
+    #call        fname_special
+    movdqa %xmm1,%xmm0 #remove this if call is made    
+    ret      
+    
+.data
+.align 16
+.L__max_exp10_arg:          .quad 0x40734413509f79ff
+.L__min_exp10_arg:            .quad 0xc07434e6420f4374
+.L__real_64_by_log10of2:    .quad 0x406A934F0979A371    # 64/log10(2)
+.L__ln10:                   .quad 0x40026BB1BBB55516
+
+.align 16
+.L__log10of2_by_64_mhead: .quad 0xbF73441350000000
+.L__log10of2_by_64_mtail: .quad 0xbda3ef3fde623e25
+.L__real_1_by_720:              .quad 0x3f56c16c16c16c17    # 1/720
+.L__real_1_by_120:              .quad 0x3f81111111111111    # 1/120
+.L__real_1_by_6:                .quad 0x3fc5555555555555    # 1/6
+.L__real_1_by_2:                .quad 0x3fe0000000000000    # 1/2
+.L__real_1_by_24:               .quad 0x3fa5555555555555    # 1/24
+.L__real_one:                   .quad 0x3ff0000000000000
+
+.align 16
+.L__two_to_jby64_table:
+    .quad 0x3ff0000000000000
+    .quad 0x3ff02c9a3e778061
+    .quad 0x3ff059b0d3158574
+    .quad 0x3ff0874518759bc8
+    .quad 0x3ff0b5586cf9890f
+    .quad 0x3ff0e3ec32d3d1a2
+    .quad 0x3ff11301d0125b51
+    .quad 0x3ff1429aaea92de0
+    .quad 0x3ff172b83c7d517b
+    .quad 0x3ff1a35beb6fcb75
+    .quad 0x3ff1d4873168b9aa
+    .quad 0x3ff2063b88628cd6
+    .quad 0x3ff2387a6e756238
+    .quad 0x3ff26b4565e27cdd
+    .quad 0x3ff29e9df51fdee1
+    .quad 0x3ff2d285a6e4030b
+    .quad 0x3ff306fe0a31b715
+    .quad 0x3ff33c08b26416ff
+    .quad 0x3ff371a7373aa9cb
+    .quad 0x3ff3a7db34e59ff7
+    .quad 0x3ff3dea64c123422
+    .quad 0x3ff4160a21f72e2a
+    .quad 0x3ff44e086061892d
+    .quad 0x3ff486a2b5c13cd0
+    .quad 0x3ff4bfdad5362a27
+    .quad 0x3ff4f9b2769d2ca7
+    .quad 0x3ff5342b569d4f82
+    .quad 0x3ff56f4736b527da
+    .quad 0x3ff5ab07dd485429
+    .quad 0x3ff5e76f15ad2148
+    .quad 0x3ff6247eb03a5585
+    .quad 0x3ff6623882552225
+    .quad 0x3ff6a09e667f3bcd
+    .quad 0x3ff6dfb23c651a2f
+    .quad 0x3ff71f75e8ec5f74
+    .quad 0x3ff75feb564267c9
+    .quad 0x3ff7a11473eb0187
+    .quad 0x3ff7e2f336cf4e62
+    .quad 0x3ff82589994cce13
+    .quad 0x3ff868d99b4492ed
+    .quad 0x3ff8ace5422aa0db
+    .quad 0x3ff8f1ae99157736
+    .quad 0x3ff93737b0cdc5e5
+    .quad 0x3ff97d829fde4e50
+    .quad 0x3ff9c49182a3f090
+    .quad 0x3ffa0c667b5de565
+    .quad 0x3ffa5503b23e255d
+    .quad 0x3ffa9e6b5579fdbf
+    .quad 0x3ffae89f995ad3ad
+    .quad 0x3ffb33a2b84f15fb
+    .quad 0x3ffb7f76f2fb5e47
+    .quad 0x3ffbcc1e904bc1d2
+    .quad 0x3ffc199bdd85529c
+    .quad 0x3ffc67f12e57d14b
+    .quad 0x3ffcb720dcef9069
+    .quad 0x3ffd072d4a07897c
+    .quad 0x3ffd5818dcfba487
+    .quad 0x3ffda9e603db3285
+    .quad 0x3ffdfc97337b9b5f
+    .quad 0x3ffe502ee78b3ff6
+    .quad 0x3ffea4afa2a490da
+    .quad 0x3ffefa1bee615a27
+    .quad 0x3fff50765b6e4540
+    .quad 0x3fffa7c1819e90d8
+
+.align 16
+.L__two_to_jby64_head_table:
+    .quad 0x3ff0000000000000
+    .quad 0x3ff02c9a30000000
+    .quad 0x3ff059b0d0000000
+    .quad 0x3ff0874510000000
+    .quad 0x3ff0b55860000000
+    .quad 0x3ff0e3ec30000000
+    .quad 0x3ff11301d0000000
+    .quad 0x3ff1429aa0000000
+    .quad 0x3ff172b830000000
+    .quad 0x3ff1a35be0000000
+    .quad 0x3ff1d48730000000
+    .quad 0x3ff2063b80000000
+    .quad 0x3ff2387a60000000
+    .quad 0x3ff26b4560000000
+    .quad 0x3ff29e9df0000000
+    .quad 0x3ff2d285a0000000
+    .quad 0x3ff306fe00000000
+    .quad 0x3ff33c08b0000000
+    .quad 0x3ff371a730000000
+    .quad 0x3ff3a7db30000000
+    .quad 0x3ff3dea640000000
+    .quad 0x3ff4160a20000000
+    .quad 0x3ff44e0860000000
+    .quad 0x3ff486a2b0000000
+    .quad 0x3ff4bfdad0000000
+    .quad 0x3ff4f9b270000000
+    .quad 0x3ff5342b50000000
+    .quad 0x3ff56f4730000000
+    .quad 0x3ff5ab07d0000000
+    .quad 0x3ff5e76f10000000
+    .quad 0x3ff6247eb0000000
+    .quad 0x3ff6623880000000
+    .quad 0x3ff6a09e60000000
+    .quad 0x3ff6dfb230000000
+    .quad 0x3ff71f75e0000000
+    .quad 0x3ff75feb50000000
+    .quad 0x3ff7a11470000000
+    .quad 0x3ff7e2f330000000
+    .quad 0x3ff8258990000000
+    .quad 0x3ff868d990000000
+    .quad 0x3ff8ace540000000
+    .quad 0x3ff8f1ae90000000
+    .quad 0x3ff93737b0000000
+    .quad 0x3ff97d8290000000
+    .quad 0x3ff9c49180000000
+    .quad 0x3ffa0c6670000000
+    .quad 0x3ffa5503b0000000
+    .quad 0x3ffa9e6b50000000
+    .quad 0x3ffae89f90000000
+    .quad 0x3ffb33a2b0000000
+    .quad 0x3ffb7f76f0000000
+    .quad 0x3ffbcc1e90000000
+    .quad 0x3ffc199bd0000000
+    .quad 0x3ffc67f120000000
+    .quad 0x3ffcb720d0000000
+    .quad 0x3ffd072d40000000
+    .quad 0x3ffd5818d0000000
+    .quad 0x3ffda9e600000000
+    .quad 0x3ffdfc9730000000
+    .quad 0x3ffe502ee0000000
+    .quad 0x3ffea4afa0000000
+    .quad 0x3ffefa1be0000000
+    .quad 0x3fff507650000000
+    .quad 0x3fffa7c180000000
+
+.align 16
+.L__two_to_jby64_tail_table:
+    .quad 0x0000000000000000
+    .quad 0x3e6cef00c1dcdef9
+    .quad 0x3e48ac2ba1d73e2a
+    .quad 0x3e60eb37901186be
+    .quad 0x3e69f3121ec53172
+    .quad 0x3e469e8d10103a17
+    .quad 0x3df25b50a4ebbf1a
+    .quad 0x3e6d525bbf668203
+    .quad 0x3e68faa2f5b9bef9
+    .quad 0x3e66df96ea796d31
+    .quad 0x3e368b9aa7805b80
+    .quad 0x3e60c519ac771dd6
+    .quad 0x3e6ceac470cd83f5
+    .quad 0x3e5789f37495e99c
+    .quad 0x3e547f7b84b09745
+    .quad 0x3e5b900c2d002475
+    .quad 0x3e64636e2a5bd1ab
+    .quad 0x3e4320b7fa64e430
+    .quad 0x3e5ceaa72a9c5154
+    .quad 0x3e53967fdba86f24
+    .quad 0x3e682468446b6824
+    .quad 0x3e3f72e29f84325b
+    .quad 0x3e18624b40c4dbd0
+    .quad 0x3e5704f3404f068e
+    .quad 0x3e54d8a89c750e5e
+    .quad 0x3e5a74b29ab4cf62
+    .quad 0x3e5a753e077c2a0f
+    .quad 0x3e5ad49f699bb2c0
+    .quad 0x3e6a90a852b19260
+    .quad 0x3e56b48521ba6f93
+    .quad 0x3e0d2ac258f87d03
+    .quad 0x3e42a91124893ecf
+    .quad 0x3e59fcef32422cbe
+    .quad 0x3e68ca345de441c5
+    .quad 0x3e61d8bee7ba46e1
+    .quad 0x3e59099f22fdba6a
+    .quad 0x3e4f580c36bea881
+    .quad 0x3e5b3d398841740a
+    .quad 0x3e62999c25159f11
+    .quad 0x3e668925d901c83b
+    .quad 0x3e415506dadd3e2a
+    .quad 0x3e622aee6c57304e
+    .quad 0x3e29b8bc9e8a0387
+    .quad 0x3e6fbc9c9f173d24
+    .quad 0x3e451f8480e3e235
+    .quad 0x3e66bbcac96535b5
+    .quad 0x3e41f12ae45a1224
+    .quad 0x3e55e7f6fd0fac90
+    .quad 0x3e62b5a75abd0e69
+    .quad 0x3e609e2bf5ed7fa1
+    .quad 0x3e47daf237553d84
+    .quad 0x3e12f074891ee83d
+    .quad 0x3e6b0aa538444196
+    .quad 0x3e6cafa29694426f
+    .quad 0x3e69df20d22a0797
+    .quad 0x3e640f12f71a1e45
+    .quad 0x3e69f7490e4bb40b
+    .quad 0x3e4ed9942b84600d
+    .quad 0x3e4bdcdaf5cb4656
+    .quad 0x3e5e2cffd89cf44c
+    .quad 0x3e452486cc2c7b9d
+    .quad 0x3e6cc2b44eee3fa4
+    .quad 0x3e66dc8a80ce9f09
+    .quad 0x3e39e90d82e90a7e
+
+
+

diff --git a/src/gas/exp10f.S b/src/gas/exp10f.S
new file mode 100644
index 0000000..da805e2
--- /dev/null
+++ b/src/gas/exp10f.S

@@ -0,0 +1,191 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(exp10f)
+#define fname_special _exp10f_special@PLT
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+    ucomiss .L__max_exp_arg(%rip), %xmm0
+    ja .L__y_is_inf
+    jp .L__y_is_nan
+    ucomiss .L__min_exp_arg(%rip), %xmm0
+    jb .L__y_is_zero
+
+    cvtps2pd     %xmm0, %xmm0    #xmm0 = (double)x
+
+    # x * (64/log10of(2))
+    movapd      %xmm0,%xmm3      #xmm3 = (xouble)x
+    mulsd       .L__real_64_by_log10of2(%rip), %xmm3  #xmm3 = x * (64/ln(2)
+
+    # n = int( x * (64/log10of(2)) )
+    cvtpd2dq    %xmm3, %xmm4  #xmm4 = (int)n
+    cvtdq2pd    %xmm4, %xmm2  #xmm2 = (double)n
+
+    # r = x - n * ln(2)/64
+    # r *= ln(10)
+    mulsd       .L__real_log10of2_by_64(%rip),%xmm2 #xmm2 = n * log10of(2)/64
+    movd        %xmm4, %ecx     #ecx = n
+    subsd       %xmm2, %xmm0    #xmm0 = r
+    mulsd       .L__real_ln10(%rip),%xmm0 #xmm0 = r = r*ln10
+    movapd      %xmm0, %xmm1    #xmm1 = r
+
+    # q = r + r*r(1/2 + r*1/6)
+    movapd       .L__real_1_by_6(%rip), %xmm3 
+    mulsd       %xmm0, %xmm3 #xmm3 = 1/6 * r
+    mulsd       %xmm1, %xmm0 #xmm0 =  r  * r
+    addsd       .L__real_1_by_2(%rip), %xmm3 #xmm3 = 1/2 + (1/6 * r)
+    mulsd       %xmm3, %xmm0  #xmm0 = r*r*(1/2 + (1/6 * r))
+    addsd       %xmm1, %xmm0  #xmm0 = r+r*r*(1/2 + (1/6 * r))
+    
+    #j = n & 0x3f
+    mov         $0x3f, %rax     #rax = 0x3f
+    and         %ecx, %eax      #eax = j = n & 0x3f
+
+    # f + (f*q)
+    lea         L__two_to_jby64_table(%rip), %r10    
+    mulsd       (%r10,%rax,8), %xmm0
+    addsd       (%r10,%rax,8), %xmm0
+
+    .p2align 4
+    # m = (n - j) / 64        
+    psrad       $6,%xmm4
+    psllq       $52,%xmm4
+    paddq       %xmm0, %xmm4
+    cvtpd2ps    %xmm4, %xmm0
+    ret
+
+.p2align 4
+.L__y_is_zero:
+    pxor        %xmm1, %xmm1    #return value in xmm1,input in xmm0 before calling
+    mov         $2, %edi        #code in edi
+    #call        fname_special
+    pxor        %xmm0,%xmm0#remove this if calling fname special
+    ret         
+
+.p2align 4
+.L__y_is_inf:
+    mov         $0x7f800000,%edx
+    movd        %edx, %xmm1
+    mov         $3, %edi
+    #call        fname_special
+    movdqa     %xmm1,%xmm0#remove this if calling fname special
+    ret     
+
+.p2align 4
+.L__y_is_nan:
+    movaps %xmm0,%xmm1
+    addss  %xmm1,%xmm1
+    mov         $1, %edi
+    #call        fname_special
+    movdqa %xmm1,%xmm0  #remove this if calling fname special
+    ret       
+    
+.data
+.align 16
+.L__max_exp_arg:                 .long 0x421A209B
+.L__min_exp_arg:                 .long 0xC23369F4
+.L__real_64_by_log10of2:        .quad 0x406A934F0979A371 # 64/log10(2)
+.L__real_log10of2_by_64:        .quad 0x3F734413509F79FF # log10of2_by_64
+.L__real_ln10:                  .quad 0x40026BB1BBB55516 # ln(10)
+.L__real_1_by_6:                .quad 0x3fc5555555555555 # 1/6
+.L__real_1_by_2:                .quad 0x3fe0000000000000 # 1/2
+
+.align 16
+.type	L__two_to_jby64_table, @object
+.size	L__two_to_jby64_table, 512
+L__two_to_jby64_table:
+    .quad 0x3ff0000000000000
+    .quad 0x3ff02c9a3e778061
+    .quad 0x3ff059b0d3158574
+    .quad 0x3ff0874518759bc8
+    .quad 0x3ff0b5586cf9890f
+    .quad 0x3ff0e3ec32d3d1a2
+    .quad 0x3ff11301d0125b51
+    .quad 0x3ff1429aaea92de0
+    .quad 0x3ff172b83c7d517b
+    .quad 0x3ff1a35beb6fcb75
+    .quad 0x3ff1d4873168b9aa
+    .quad 0x3ff2063b88628cd6
+    .quad 0x3ff2387a6e756238
+    .quad 0x3ff26b4565e27cdd
+    .quad 0x3ff29e9df51fdee1
+    .quad 0x3ff2d285a6e4030b
+    .quad 0x3ff306fe0a31b715
+    .quad 0x3ff33c08b26416ff
+    .quad 0x3ff371a7373aa9cb
+    .quad 0x3ff3a7db34e59ff7
+    .quad 0x3ff3dea64c123422
+    .quad 0x3ff4160a21f72e2a
+    .quad 0x3ff44e086061892d
+    .quad 0x3ff486a2b5c13cd0
+    .quad 0x3ff4bfdad5362a27
+    .quad 0x3ff4f9b2769d2ca7
+    .quad 0x3ff5342b569d4f82
+    .quad 0x3ff56f4736b527da
+    .quad 0x3ff5ab07dd485429
+    .quad 0x3ff5e76f15ad2148
+    .quad 0x3ff6247eb03a5585
+    .quad 0x3ff6623882552225
+    .quad 0x3ff6a09e667f3bcd
+    .quad 0x3ff6dfb23c651a2f
+    .quad 0x3ff71f75e8ec5f74
+    .quad 0x3ff75feb564267c9
+    .quad 0x3ff7a11473eb0187
+    .quad 0x3ff7e2f336cf4e62
+    .quad 0x3ff82589994cce13
+    .quad 0x3ff868d99b4492ed
+    .quad 0x3ff8ace5422aa0db
+    .quad 0x3ff8f1ae99157736
+    .quad 0x3ff93737b0cdc5e5
+    .quad 0x3ff97d829fde4e50
+    .quad 0x3ff9c49182a3f090
+    .quad 0x3ffa0c667b5de565
+    .quad 0x3ffa5503b23e255d
+    .quad 0x3ffa9e6b5579fdbf
+    .quad 0x3ffae89f995ad3ad
+    .quad 0x3ffb33a2b84f15fb
+    .quad 0x3ffb7f76f2fb5e47
+    .quad 0x3ffbcc1e904bc1d2
+    .quad 0x3ffc199bdd85529c
+    .quad 0x3ffc67f12e57d14b
+    .quad 0x3ffcb720dcef9069
+    .quad 0x3ffd072d4a07897c
+    .quad 0x3ffd5818dcfba487
+    .quad 0x3ffda9e603db3285
+    .quad 0x3ffdfc97337b9b5f
+    .quad 0x3ffe502ee78b3ff6
+    .quad 0x3ffea4afa2a490da
+    .quad 0x3ffefa1bee615a27
+    .quad 0x3fff50765b6e4540
+    .quad 0x3fffa7c1819e90d8
+
+

diff --git a/src/gas/exp2.S b/src/gas/exp2.S
new file mode 100644
index 0000000..8e556d4
--- /dev/null
+++ b/src/gas/exp2.S

@@ -0,0 +1,355 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(exp2)
+#define fname_special _exp2_special@PLT
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+    ucomisd      .L__max_exp2_arg(%rip), %xmm0
+    ja           .L__y_is_inf
+    jp           .L__y_is_nan
+    ucomisd      .L__min_exp2_arg(%rip), %xmm0
+    jbe          .L__y_is_zero
+
+    # x * (64)
+    movapd      %xmm0,%xmm2        
+    mulsd       .L__real_64(%rip), %xmm2
+
+    # n = int( x * (64))
+    cvttpd2dq    %xmm2, %xmm1   #xmm1 = (int)n
+    cvtdq2pd    %xmm1, %xmm2   #xmm2 = (double)n
+    movd        %xmm1, %ecx
+
+    # r = x - n * 1/64  
+    #r *= ln2;      
+    mulsd    .L__one_by_64(%rip),%xmm2
+    addsd    %xmm0,%xmm2   #xmm2 = r    
+    mulsd    .L__ln_2(%rip),%xmm2    
+        
+    #j = n & 0x3f    
+    mov         $0x3f, %rax
+    and         %ecx, %eax     #eax = j
+    # m = (n - j) / 64      
+    sar         $6, %ecx       #ecx = m
+
+    # q = r + r^2*1/2 + r^3*1/6 + r^4 *1/24 + r^5*1/120 + r^6*1/720
+    # q = r + r*r*(1/2 + r*(1/6+ r*(1/24 + r*(1/120 + r*(1/720)))))
+    movapd       .L__real_1_by_720(%rip), %xmm3  #xmm3 = 1/720
+    mulsd       %xmm2, %xmm3    #xmm3 = r*1/720
+    movapd       .L__real_1_by_6(%rip), %xmm0    #xmm0 = 1/6    
+    movapd      %xmm2, %xmm1 #xmm1 = r            
+    mulsd       %xmm2, %xmm0    #xmm0 = r*1/6
+    addsd       .L__real_1_by_120(%rip), %xmm3  #xmm3 = 1/120 + (r*1/720)
+    mulsd       %xmm2, %xmm1    #xmm1 = r*r    
+    addsd       .L__real_1_by_2(%rip), %xmm0  #xmm0 = 1/2 + (r*1/6)        
+    movapd       %xmm1, %xmm4   #xmm4 = r*r
+    mulsd       %xmm1, %xmm4    #xmm4 = (r*r) * (r*r)    
+    mulsd       %xmm2, %xmm3    #xmm3 = r * (1/120 + (r*1/720))
+    mulsd       %xmm1, %xmm0    #xmm0 = (r*r)*(1/2 + (r*1/6))
+    addsd       .L__real_1_by_24(%rip), %xmm3  #xmm3 = 1/24 + (r * (1/120 + (r*1/720)))
+    addsd       %xmm2, %xmm0   #xmm0 = r + ((r*r)*(1/2 + (r*1/6)))
+    mulsd       %xmm4, %xmm3   #xmm3 = ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+    addsd       %xmm3, %xmm0   #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+    
+    # (f)*(q) + f2 + f1
+    cmp         $0xfffffc02, %ecx # -1022    
+    lea         .L__two_to_jby64_table(%rip), %rdx        
+    lea         .L__two_to_jby64_tail_table(%rip), %r11       
+    lea         .L__two_to_jby64_head_table(%rip), %r10      
+    mulsd       (%rdx,%rax,8), %xmm0
+    addsd       (%r11,%rax,8), %xmm0
+    addsd       (%r10,%rax,8), %xmm0        
+
+    jle         .L__process_denormal 
+.L__process_normal:
+    shl         $52, %rcx    
+    movd        %rcx,%xmm2
+    paddq       %xmm2, %xmm0
+    ret
+
+.p2align 4
+.L__process_denormal:
+    jl          .L__process_true_denormal
+    ucomisd     .L__real_one(%rip), %xmm0
+    jae         .L__process_normal
+.L__process_true_denormal:
+    # here ( e^r < 1 and m = -1022 ) or m <= -1023
+    add         $1074, %ecx
+    mov         $1, %rax    
+    shl         %cl, %rax
+    movd         %rax, %xmm2
+    mulsd       %xmm2, %xmm0
+    ret        
+    
+.p2align 4
+.L__y_is_inf:
+    mov         $0x7ff0000000000000,%rax
+    movd       %rax, %xmm1
+    mov         $3, %edi
+    #call        fname_special
+    movdqa     %xmm1,%xmm0 #remove this if call is made    
+    ret     
+
+.p2align 4
+.L__y_is_nan:
+    movapd      %xmm0,%xmm1
+    addsd       %xmm0,%xmm1
+    mov         $1, %edi
+    #call        fname_special
+    movdqa     %xmm1,%xmm0 #remove this if call is made    
+    ret
+
+.p2align 4
+.L__y_is_zero:
+    pxor        %xmm1,%xmm1
+    mov         $2, %edi
+    #call        fname_special
+    movdqa     %xmm1,%xmm0 #remove this if call is made
+    ret      
+    
+.data
+.align 16
+.L__max_exp2_arg:            .quad 0x4090000000000000
+.L__min_exp2_arg:            .quad 0xc090c80000000000
+.L__real_64:                 .quad 0x4050000000000000    # 64
+.L__ln_2:                    .quad 0x3FE62E42FEFA39EF
+.L__one_by_64:               .quad 0xbF90000000000000
+
+.align 16
+.L__real_1_by_720:              .quad 0x3f56c16c16c16c17    # 1/720
+.L__real_1_by_120:              .quad 0x3f81111111111111    # 1/120
+.L__real_1_by_6:                .quad 0x3fc5555555555555    # 1/6
+.L__real_1_by_2:                .quad 0x3fe0000000000000    # 1/2
+.L__real_1_by_24:               .quad 0x3fa5555555555555    # 1/24
+.L__real_one:                   .quad 0x3ff0000000000000
+
+.align 16
+.L__two_to_jby64_table:
+    .quad 0x3ff0000000000000
+    .quad 0x3ff02c9a3e778061
+    .quad 0x3ff059b0d3158574
+    .quad 0x3ff0874518759bc8
+    .quad 0x3ff0b5586cf9890f
+    .quad 0x3ff0e3ec32d3d1a2
+    .quad 0x3ff11301d0125b51
+    .quad 0x3ff1429aaea92de0
+    .quad 0x3ff172b83c7d517b
+    .quad 0x3ff1a35beb6fcb75
+    .quad 0x3ff1d4873168b9aa
+    .quad 0x3ff2063b88628cd6
+    .quad 0x3ff2387a6e756238
+    .quad 0x3ff26b4565e27cdd
+    .quad 0x3ff29e9df51fdee1
+    .quad 0x3ff2d285a6e4030b
+    .quad 0x3ff306fe0a31b715
+    .quad 0x3ff33c08b26416ff
+    .quad 0x3ff371a7373aa9cb
+    .quad 0x3ff3a7db34e59ff7
+    .quad 0x3ff3dea64c123422
+    .quad 0x3ff4160a21f72e2a
+    .quad 0x3ff44e086061892d
+    .quad 0x3ff486a2b5c13cd0
+    .quad 0x3ff4bfdad5362a27
+    .quad 0x3ff4f9b2769d2ca7
+    .quad 0x3ff5342b569d4f82
+    .quad 0x3ff56f4736b527da
+    .quad 0x3ff5ab07dd485429
+    .quad 0x3ff5e76f15ad2148
+    .quad 0x3ff6247eb03a5585
+    .quad 0x3ff6623882552225
+    .quad 0x3ff6a09e667f3bcd
+    .quad 0x3ff6dfb23c651a2f
+    .quad 0x3ff71f75e8ec5f74
+    .quad 0x3ff75feb564267c9
+    .quad 0x3ff7a11473eb0187
+    .quad 0x3ff7e2f336cf4e62
+    .quad 0x3ff82589994cce13
+    .quad 0x3ff868d99b4492ed
+    .quad 0x3ff8ace5422aa0db
+    .quad 0x3ff8f1ae99157736
+    .quad 0x3ff93737b0cdc5e5
+    .quad 0x3ff97d829fde4e50
+    .quad 0x3ff9c49182a3f090
+    .quad 0x3ffa0c667b5de565
+    .quad 0x3ffa5503b23e255d
+    .quad 0x3ffa9e6b5579fdbf
+    .quad 0x3ffae89f995ad3ad
+    .quad 0x3ffb33a2b84f15fb
+    .quad 0x3ffb7f76f2fb5e47
+    .quad 0x3ffbcc1e904bc1d2
+    .quad 0x3ffc199bdd85529c
+    .quad 0x3ffc67f12e57d14b
+    .quad 0x3ffcb720dcef9069
+    .quad 0x3ffd072d4a07897c
+    .quad 0x3ffd5818dcfba487
+    .quad 0x3ffda9e603db3285
+    .quad 0x3ffdfc97337b9b5f
+    .quad 0x3ffe502ee78b3ff6
+    .quad 0x3ffea4afa2a490da
+    .quad 0x3ffefa1bee615a27
+    .quad 0x3fff50765b6e4540
+    .quad 0x3fffa7c1819e90d8
+
+.align 16
+.L__two_to_jby64_head_table:
+    .quad 0x3ff0000000000000
+    .quad 0x3ff02c9a30000000
+    .quad 0x3ff059b0d0000000
+    .quad 0x3ff0874510000000
+    .quad 0x3ff0b55860000000
+    .quad 0x3ff0e3ec30000000
+    .quad 0x3ff11301d0000000
+    .quad 0x3ff1429aa0000000
+    .quad 0x3ff172b830000000
+    .quad 0x3ff1a35be0000000
+    .quad 0x3ff1d48730000000
+    .quad 0x3ff2063b80000000
+    .quad 0x3ff2387a60000000
+    .quad 0x3ff26b4560000000
+    .quad 0x3ff29e9df0000000
+    .quad 0x3ff2d285a0000000
+    .quad 0x3ff306fe00000000
+    .quad 0x3ff33c08b0000000
+    .quad 0x3ff371a730000000
+    .quad 0x3ff3a7db30000000
+    .quad 0x3ff3dea640000000
+    .quad 0x3ff4160a20000000
+    .quad 0x3ff44e0860000000
+    .quad 0x3ff486a2b0000000
+    .quad 0x3ff4bfdad0000000
+    .quad 0x3ff4f9b270000000
+    .quad 0x3ff5342b50000000
+    .quad 0x3ff56f4730000000
+    .quad 0x3ff5ab07d0000000
+    .quad 0x3ff5e76f10000000
+    .quad 0x3ff6247eb0000000
+    .quad 0x3ff6623880000000
+    .quad 0x3ff6a09e60000000
+    .quad 0x3ff6dfb230000000
+    .quad 0x3ff71f75e0000000
+    .quad 0x3ff75feb50000000
+    .quad 0x3ff7a11470000000
+    .quad 0x3ff7e2f330000000
+    .quad 0x3ff8258990000000
+    .quad 0x3ff868d990000000
+    .quad 0x3ff8ace540000000
+    .quad 0x3ff8f1ae90000000
+    .quad 0x3ff93737b0000000
+    .quad 0x3ff97d8290000000
+    .quad 0x3ff9c49180000000
+    .quad 0x3ffa0c6670000000
+    .quad 0x3ffa5503b0000000
+    .quad 0x3ffa9e6b50000000
+    .quad 0x3ffae89f90000000
+    .quad 0x3ffb33a2b0000000
+    .quad 0x3ffb7f76f0000000
+    .quad 0x3ffbcc1e90000000
+    .quad 0x3ffc199bd0000000
+    .quad 0x3ffc67f120000000
+    .quad 0x3ffcb720d0000000
+    .quad 0x3ffd072d40000000
+    .quad 0x3ffd5818d0000000
+    .quad 0x3ffda9e600000000
+    .quad 0x3ffdfc9730000000
+    .quad 0x3ffe502ee0000000
+    .quad 0x3ffea4afa0000000
+    .quad 0x3ffefa1be0000000
+    .quad 0x3fff507650000000
+    .quad 0x3fffa7c180000000
+
+.align 16
+.L__two_to_jby64_tail_table:
+    .quad 0x0000000000000000
+    .quad 0x3e6cef00c1dcdef9
+    .quad 0x3e48ac2ba1d73e2a
+    .quad 0x3e60eb37901186be
+    .quad 0x3e69f3121ec53172
+    .quad 0x3e469e8d10103a17
+    .quad 0x3df25b50a4ebbf1a
+    .quad 0x3e6d525bbf668203
+    .quad 0x3e68faa2f5b9bef9
+    .quad 0x3e66df96ea796d31
+    .quad 0x3e368b9aa7805b80
+    .quad 0x3e60c519ac771dd6
+    .quad 0x3e6ceac470cd83f5
+    .quad 0x3e5789f37495e99c
+    .quad 0x3e547f7b84b09745
+    .quad 0x3e5b900c2d002475
+    .quad 0x3e64636e2a5bd1ab
+    .quad 0x3e4320b7fa64e430
+    .quad 0x3e5ceaa72a9c5154
+    .quad 0x3e53967fdba86f24
+    .quad 0x3e682468446b6824
+    .quad 0x3e3f72e29f84325b
+    .quad 0x3e18624b40c4dbd0
+    .quad 0x3e5704f3404f068e
+    .quad 0x3e54d8a89c750e5e
+    .quad 0x3e5a74b29ab4cf62
+    .quad 0x3e5a753e077c2a0f
+    .quad 0x3e5ad49f699bb2c0
+    .quad 0x3e6a90a852b19260
+    .quad 0x3e56b48521ba6f93
+    .quad 0x3e0d2ac258f87d03
+    .quad 0x3e42a91124893ecf
+    .quad 0x3e59fcef32422cbe
+    .quad 0x3e68ca345de441c5
+    .quad 0x3e61d8bee7ba46e1
+    .quad 0x3e59099f22fdba6a
+    .quad 0x3e4f580c36bea881
+    .quad 0x3e5b3d398841740a
+    .quad 0x3e62999c25159f11
+    .quad 0x3e668925d901c83b
+    .quad 0x3e415506dadd3e2a
+    .quad 0x3e622aee6c57304e
+    .quad 0x3e29b8bc9e8a0387
+    .quad 0x3e6fbc9c9f173d24
+    .quad 0x3e451f8480e3e235
+    .quad 0x3e66bbcac96535b5
+    .quad 0x3e41f12ae45a1224
+    .quad 0x3e55e7f6fd0fac90
+    .quad 0x3e62b5a75abd0e69
+    .quad 0x3e609e2bf5ed7fa1
+    .quad 0x3e47daf237553d84
+    .quad 0x3e12f074891ee83d
+    .quad 0x3e6b0aa538444196
+    .quad 0x3e6cafa29694426f
+    .quad 0x3e69df20d22a0797
+    .quad 0x3e640f12f71a1e45
+    .quad 0x3e69f7490e4bb40b
+    .quad 0x3e4ed9942b84600d
+    .quad 0x3e4bdcdaf5cb4656
+    .quad 0x3e5e2cffd89cf44c
+    .quad 0x3e452486cc2c7b9d
+    .quad 0x3e6cc2b44eee3fa4
+    .quad 0x3e66dc8a80ce9f09
+    .quad 0x3e39e90d82e90a7e
+
+

diff --git a/src/gas/exp2f.S b/src/gas/exp2f.S
new file mode 100644
index 0000000..78c50e0
--- /dev/null
+++ b/src/gas/exp2f.S

@@ -0,0 +1,193 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(exp2f)
+#define fname_special _exp2f_special@PLT
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+    ucomiss .L__max_exp2_arg(%rip), %xmm0
+    ja .L__y_is_inf
+    jp .L__y_is_nan
+    ucomiss .L__min_exp2_arg(%rip), %xmm0
+    jb .L__y_is_zero
+
+    cvtps2pd     %xmm0, %xmm0    #xmm0 = (double)x
+
+    # x * (64)
+    movapd      %xmm0,%xmm3      #xmm3 = (double)x
+    #mulsd       .L__sixtyfour(%rip), %xmm3  #xmm3 = x * (64)
+    paddq       .L__sixtyfour(%rip), %xmm3  #xmm3 = x * (64)
+
+    # n = int( x * (64)
+    cvtpd2dq    %xmm3, %xmm4  #xmm4 = (int)n
+    cvtdq2pd    %xmm4, %xmm2  #xmm2 = (double)n
+
+    # r = x - n * 1/64
+    # r *= ln(2)
+    mulsd       .L__one_by_64(%rip),%xmm2 #xmm2 = n * 1/64
+    movd        %xmm4, %ecx     #ecx = n
+    subsd       %xmm2, %xmm0    #xmm0 = r
+    mulsd       .L__ln2(%rip),%xmm0 #xmm0 = r = r*ln(2)    
+    movapd      %xmm0, %xmm1    #xmm1 = r
+
+    # q
+    movsd       .L__real_1_by_6(%rip), %xmm3 
+    mulsd       %xmm0, %xmm3 #xmm3 = 1/6 * r
+    mulsd       %xmm1, %xmm0 #xmm0 =  r  * r
+    addsd       .L__real_1_by_2(%rip), %xmm3 #xmm3 = 1/2 + (1/6 * r)
+    mulsd       %xmm3, %xmm0  #xmm0 = r*r*(1/2 + (1/6 * r))
+    addsd       %xmm1, %xmm0  #xmm0 = r+r*r*(1/2 + (1/6 * r))
+    
+    #j = n & 0x3f
+    mov         $0x3f, %rax     #rax = 0x3f
+    and         %ecx, %eax      #eax = j = n & 0x3f
+
+    # f + (f*q)
+    lea         L__two_to_jby64_table(%rip), %r10    
+    mulsd       (%r10,%rax,8), %xmm0
+    addsd       (%r10,%rax,8), %xmm0
+
+    .p2align 4
+    # m = (n - j) / 64        
+    psrad       $6,%xmm4
+    psllq       $52,%xmm4
+    paddq       %xmm0, %xmm4
+    cvtpd2ps    %xmm4, %xmm0
+    ret
+
+.p2align 4
+.L__y_is_zero:
+    pxor        %xmm1, %xmm1    #return value in xmm1,input in xmm0 before calling
+    mov         $2, %edi        #code in edi
+    #call        fname_special
+    pxor        %xmm0,%xmm0#remove this if calling fname special
+    ret         
+
+.p2align 4
+.L__y_is_inf:
+    mov         $0x7f800000,%edx
+    movd        %edx, %xmm1
+    mov         $3, %edi
+    #call        fname_special
+    movdqa     %xmm1,%xmm0#remove this if calling fname special
+    ret     
+
+.p2align 4
+.L__y_is_nan:
+    movaps %xmm0,%xmm1
+    addss  %xmm1,%xmm1
+    mov         $1, %edi
+    #call        fname_special
+    movdqa %xmm1,%xmm0  #remove this if calling fname special
+    ret      
+    
+.data
+.align 16
+.L__max_exp2_arg:                 .long 0x43000000
+.L__min_exp2_arg:                 .long 0xc3150000
+.align 16
+.L__sixtyfour:                  .quad 0x0060000000000000 # 64
+.L__one_by_64:                  .quad 0x3F90000000000000 # 1/64
+.L__ln2:                        .quad 0x3FE62E42FEFA39EF # ln(2)
+.L__real_1_by_6:                .quad 0x3fc5555555555555 # 1/6
+.L__real_1_by_2:                .quad 0x3fe0000000000000 # 1/2
+
+.align 16
+.type	L__two_to_jby64_table, @object
+.size	L__two_to_jby64_table, 512
+L__two_to_jby64_table:
+    .quad 0x3ff0000000000000
+    .quad 0x3ff02c9a3e778061
+    .quad 0x3ff059b0d3158574
+    .quad 0x3ff0874518759bc8
+    .quad 0x3ff0b5586cf9890f
+    .quad 0x3ff0e3ec32d3d1a2
+    .quad 0x3ff11301d0125b51
+    .quad 0x3ff1429aaea92de0
+    .quad 0x3ff172b83c7d517b
+    .quad 0x3ff1a35beb6fcb75
+    .quad 0x3ff1d4873168b9aa
+    .quad 0x3ff2063b88628cd6
+    .quad 0x3ff2387a6e756238
+    .quad 0x3ff26b4565e27cdd
+    .quad 0x3ff29e9df51fdee1
+    .quad 0x3ff2d285a6e4030b
+    .quad 0x3ff306fe0a31b715
+    .quad 0x3ff33c08b26416ff
+    .quad 0x3ff371a7373aa9cb
+    .quad 0x3ff3a7db34e59ff7
+    .quad 0x3ff3dea64c123422
+    .quad 0x3ff4160a21f72e2a
+    .quad 0x3ff44e086061892d
+    .quad 0x3ff486a2b5c13cd0
+    .quad 0x3ff4bfdad5362a27
+    .quad 0x3ff4f9b2769d2ca7
+    .quad 0x3ff5342b569d4f82
+    .quad 0x3ff56f4736b527da
+    .quad 0x3ff5ab07dd485429
+    .quad 0x3ff5e76f15ad2148
+    .quad 0x3ff6247eb03a5585
+    .quad 0x3ff6623882552225
+    .quad 0x3ff6a09e667f3bcd
+    .quad 0x3ff6dfb23c651a2f
+    .quad 0x3ff71f75e8ec5f74
+    .quad 0x3ff75feb564267c9
+    .quad 0x3ff7a11473eb0187
+    .quad 0x3ff7e2f336cf4e62
+    .quad 0x3ff82589994cce13
+    .quad 0x3ff868d99b4492ed
+    .quad 0x3ff8ace5422aa0db
+    .quad 0x3ff8f1ae99157736
+    .quad 0x3ff93737b0cdc5e5
+    .quad 0x3ff97d829fde4e50
+    .quad 0x3ff9c49182a3f090
+    .quad 0x3ffa0c667b5de565
+    .quad 0x3ffa5503b23e255d
+    .quad 0x3ffa9e6b5579fdbf
+    .quad 0x3ffae89f995ad3ad
+    .quad 0x3ffb33a2b84f15fb
+    .quad 0x3ffb7f76f2fb5e47
+    .quad 0x3ffbcc1e904bc1d2
+    .quad 0x3ffc199bdd85529c
+    .quad 0x3ffc67f12e57d14b
+    .quad 0x3ffcb720dcef9069
+    .quad 0x3ffd072d4a07897c
+    .quad 0x3ffd5818dcfba487
+    .quad 0x3ffda9e603db3285
+    .quad 0x3ffdfc97337b9b5f
+    .quad 0x3ffe502ee78b3ff6
+    .quad 0x3ffea4afa2a490da
+    .quad 0x3ffefa1bee615a27
+    .quad 0x3fff50765b6e4540
+    .quad 0x3fffa7c1819e90d8
+
+

diff --git a/src/gas/expf.S b/src/gas/expf.S
new file mode 100644
index 0000000..cefa608
--- /dev/null
+++ b/src/gas/expf.S

@@ -0,0 +1,201 @@
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+#ifdef __x86_64__
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# expf.S
+#
+# An implementation of the expf libm function.
+#
+# Prototype:
+#
+#     float expf(float x);
+#
+
+#
+#   Algorithm:
+#       Similar to one presnted in exp.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(expf)
+#define fname_special _expf_special@PLT
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+    ucomiss .L__max_exp_arg(%rip), %xmm0
+    ja .L__y_is_inf
+    jp .L__y_is_nan
+    ucomiss .L__min_exp_arg(%rip), %xmm0
+    jb .L__y_is_zero
+
+    cvtps2pd     %xmm0, %xmm0    #xmm0 = (double)x
+
+    # x * (64/ln(2))
+    movapd      %xmm0,%xmm3      #xmm3 = (xouble)x
+    mulsd       .L__real_64_by_log2(%rip), %xmm3  #xmm3 = x * (64/ln(2)
+
+    # n = int( x * (64/ln(2)) )
+    cvtpd2dq    %xmm3, %xmm4  #xmm4 = (int)n
+    cvtdq2pd    %xmm4, %xmm2  #xmm2 = (double)n
+
+    # r = x - n * ln(2)/64
+    mulsd       .L__real_log2_by_64(%rip),%xmm2 #xmm2 = n * ln(2)/64
+    movd        %xmm4, %ecx     #ecx = n
+    subsd       %xmm2, %xmm0    #xmm0 = r
+    movapd      %xmm0, %xmm1    #xmm1 = r
+
+    # q
+    movsd       .L__real_1_by_6(%rip), %xmm3 
+    mulsd       %xmm0, %xmm3 #xmm3 = 1/6 * r
+    mulsd       %xmm1, %xmm0 #xmm0 =  r  * r
+    addsd       .L__real_1_by_2(%rip), %xmm3 #xmm3 = 1/2 + (1/6 * r)
+    mulsd       %xmm3, %xmm0  #xmm0 = r*r*(1/2 + (1/6 * r))
+    addsd       %xmm1, %xmm0  #xmm0 = r+r*r*(1/2 + (1/6 * r))
+    
+    #j = n & 0x3f
+    mov         $0x3f, %rax     #rax = 0x3f
+    and         %ecx, %eax      #eax = j = n & 0x3f
+    # m = (n - j) / 64    
+    sar         $6, %ecx        #ecx = m
+    shl         $52, %rcx
+
+    # (f)*(1+q)
+    lea         L__two_to_jby64_table(%rip), %r10    
+    movsd       (%r10,%rax,8), %xmm2
+    mulsd       %xmm2, %xmm0
+    addsd       %xmm2, %xmm0
+
+    movd        %rcx, %xmm1
+    paddq       %xmm0, %xmm1
+    cvtpd2ps    %xmm1, %xmm0
+    ret
+
+.p2align 4
+.L__y_is_zero:
+
+    pxor        %xmm1, %xmm1    #return value in xmm1,input in xmm0 before calling
+    mov         $2, %edi        #code in edi
+    jmp         fname_special
+
+.p2align 4
+.L__y_is_inf:
+
+    mov         $0x7f800000,%edx
+    movd        %edx, %xmm1
+    mov         $3, %edi
+    jmp         fname_special
+
+.p2align 4
+.L__y_is_nan:
+    movaps %xmm0,%xmm1
+    addss  %xmm1,%xmm1
+    mov         $1, %edi
+    jmp         fname_special
+    
+.data
+.align 16
+.L__max_exp_arg:                 .long 0x42B17218
+.L__min_exp_arg:                 .long 0xC2CE8ED0
+.L__real_64_by_log2:            .quad 0x40571547652b82fe # 64/ln(2)
+.L__real_log2_by_64:            .quad 0x3f862e42fefa39ef # log2_by_64
+.L__real_1_by_6:                .quad 0x3fc5555555555555 # 1/6
+.L__real_1_by_2:                .quad 0x3fe0000000000000 # 1/2
+
+.align 16
+.type	L__two_to_jby64_table, @object
+.size	L__two_to_jby64_table, 512
+L__two_to_jby64_table:
+    .quad 0x3ff0000000000000
+    .quad 0x3ff02c9a3e778061
+    .quad 0x3ff059b0d3158574
+    .quad 0x3ff0874518759bc8
+    .quad 0x3ff0b5586cf9890f
+    .quad 0x3ff0e3ec32d3d1a2
+    .quad 0x3ff11301d0125b51
+    .quad 0x3ff1429aaea92de0
+    .quad 0x3ff172b83c7d517b
+    .quad 0x3ff1a35beb6fcb75
+    .quad 0x3ff1d4873168b9aa
+    .quad 0x3ff2063b88628cd6
+    .quad 0x3ff2387a6e756238
+    .quad 0x3ff26b4565e27cdd
+    .quad 0x3ff29e9df51fdee1
+    .quad 0x3ff2d285a6e4030b
+    .quad 0x3ff306fe0a31b715
+    .quad 0x3ff33c08b26416ff
+    .quad 0x3ff371a7373aa9cb
+    .quad 0x3ff3a7db34e59ff7
+    .quad 0x3ff3dea64c123422
+    .quad 0x3ff4160a21f72e2a
+    .quad 0x3ff44e086061892d
+    .quad 0x3ff486a2b5c13cd0
+    .quad 0x3ff4bfdad5362a27
+    .quad 0x3ff4f9b2769d2ca7
+    .quad 0x3ff5342b569d4f82
+    .quad 0x3ff56f4736b527da
+    .quad 0x3ff5ab07dd485429
+    .quad 0x3ff5e76f15ad2148
+    .quad 0x3ff6247eb03a5585
+    .quad 0x3ff6623882552225
+    .quad 0x3ff6a09e667f3bcd
+    .quad 0x3ff6dfb23c651a2f
+    .quad 0x3ff71f75e8ec5f74
+    .quad 0x3ff75feb564267c9
+    .quad 0x3ff7a11473eb0187
+    .quad 0x3ff7e2f336cf4e62
+    .quad 0x3ff82589994cce13
+    .quad 0x3ff868d99b4492ed
+    .quad 0x3ff8ace5422aa0db
+    .quad 0x3ff8f1ae99157736
+    .quad 0x3ff93737b0cdc5e5
+    .quad 0x3ff97d829fde4e50
+    .quad 0x3ff9c49182a3f090
+    .quad 0x3ffa0c667b5de565
+    .quad 0x3ffa5503b23e255d
+    .quad 0x3ffa9e6b5579fdbf
+    .quad 0x3ffae89f995ad3ad
+    .quad 0x3ffb33a2b84f15fb
+    .quad 0x3ffb7f76f2fb5e47
+    .quad 0x3ffbcc1e904bc1d2
+    .quad 0x3ffc199bdd85529c
+    .quad 0x3ffc67f12e57d14b
+    .quad 0x3ffcb720dcef9069
+    .quad 0x3ffd072d4a07897c
+    .quad 0x3ffd5818dcfba487
+    .quad 0x3ffda9e603db3285
+    .quad 0x3ffdfc97337b9b5f
+    .quad 0x3ffe502ee78b3ff6
+    .quad 0x3ffea4afa2a490da
+    .quad 0x3ffefa1bee615a27
+    .quad 0x3fff50765b6e4540
+    .quad 0x3fffa7c1819e90d8
+
+
+#endif

diff --git a/src/gas/expm1.S b/src/gas/expm1.S
new file mode 100644
index 0000000..dff043c
--- /dev/null
+++ b/src/gas/expm1.S

@@ -0,0 +1,359 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(expm1)
+
+#ifdef __ELF__
+    .section .note.GNU-stack,"",@progbits
+#endif
+
+	.text
+	.p2align 4
+.globl fname
+	.type	fname, @function
+		
+fname:
+
+    ucomisd .L__max_expm1_arg(%rip),%xmm0  #check if(x > 709.8)
+    ja .L__Max_Arg
+    jp .L__Max_Arg
+    ucomisd .L__min_expm1_arg(%rip),%xmm0  #if(x < -37.42994775023704)
+    jb .L__Min_Arg
+    ucomisd .L__log_OneMinus_OneByFour(%rip),%xmm0
+    jbe .L__Normal_Flow
+    ucomisd .L__log_OnePlus_OneByFour(%rip),%xmm0
+    jb .L__Small_Arg 
+    
+    .p2align 4
+.L__Normal_Flow:
+    movapd %xmm0,%xmm1  #xmm1 = x
+    mulsd .L__thirtyTwo_by_ln2(%rip),%xmm1   #xmm1 = x*thirtyTwo_by_ln2
+    ucomisd .L__zero(%rip),%xmm1             #check if temp < 0.0
+    jae .L__Add_Point_Five
+    subsd .L__point_Five(%rip),%xmm1
+    jmp .L__next     
+.L__Add_Point_Five:
+    addsd .L__point_Five(%rip),%xmm1         #xmm1 = temp +/- 0.5
+.L__next:
+    cvttpd2dq %xmm1,%xmm2              #xmm2 = (int)n
+    cvtdq2pd  %xmm2,%xmm1              #xmm1 = (double)n  
+    movapd %xmm2,%xmm3                 #xmm3 = (int)n
+    psrad $5,%xmm2                     #xmm2 = m
+    pslld $27,%xmm3            
+    psrld $27,%xmm3                    #xmm3 = j    
+    movd %xmm3,%edx                    #edx = j
+    movd %xmm2,%ecx                    #ecx = m
+    
+    movlhps %xmm1,%xmm1                #xmm1 = n,n
+    mulpd .L__Ln2By32_MinusTrailLead(%rip),%xmm1 
+    movapd %xmm0,%xmm2
+    subsd %xmm1,%xmm2                  #xmm2 = r1
+    psrldq $8,%xmm1                    #xmm1 = r2
+    movapd %xmm2,%xmm3                 #xmm3 = r1    
+    addsd %xmm1,%xmm3                  #xmm3 = r
+    #q = r*(r*(A1.f64 + r*(A2.f64 + r*(A3.f64 + r*(A4.f64 + r*(A5.f64))))));    
+    movapd %xmm3,%xmm4
+    mulsd .L__A5(%rip),%xmm4
+    addsd .L__A4(%rip),%xmm4
+    mulsd %xmm3,%xmm4
+    addsd .L__A3(%rip),%xmm4
+    mulsd %xmm3,%xmm4
+    addsd .L__A2(%rip),%xmm4
+    mulsd %xmm3,%xmm4
+    addsd .L__A1(%rip),%xmm4
+    mulsd %xmm3,%xmm4
+    mulsd %xmm4,%xmm3                #xmm3 = q
+    
+    shl $4,%edx
+	lea  S_lead_and_trail_table(%rip),%rax
+    movdqa  (%rax,%rdx,1),%xmm5       #xmm5 = S_T,S_L
+    
+    #p = (r2+q) + r1;
+    addsd %xmm3,%xmm1
+    addsd %xmm1,%xmm2                #xmm2 = p
+    
+    #s = S_L.f64 + S_T.f64;    
+    movhlps %xmm5,%xmm4              #xmm4 = S_T
+    movapd %xmm4,%xmm3               #xmm3 = S_T
+    addsd %xmm5,%xmm3                #xmm3 = s
+    
+    cmp $52,%ecx        #check m > 52
+    jg .L__M_Above_52
+    cmp $-7,%ecx        #check if m < -7
+    jl .L__M_Below_Minus7
+    #(-8 < m) && (m < 53)
+    movapd %xmm2,%xmm3               #xmm3 = p
+    addsd .L__One(%rip),%xmm3  #xmm3 = 1+p
+    mulsd %xmm4,%xmm3          #xmm3 = S_T.f64 *(1+p)
+    mulsd %xmm5,%xmm2                #xmm2 = S_L*p
+    addsd %xmm3,%xmm2 #xmm2 = (S_L.f64*p+ S_T.f64 *(1+p))
+    mov $1023,%edx
+    sub %ecx,%edx                    #edx = twopmm
+    shl $52,%rdx
+    movd %rdx,%xmm1            #xmm1 = twopmm
+    subsd %xmm1,%xmm5    #xmm5 = S_L.f64 - twopmm.f64
+    addsd %xmm5,%xmm2
+    shl $52,%rcx
+    movd %rcx,%xmm0      #xmm0 = twopm
+    paddq %xmm2,%xmm0   #xmm0 = twopm *(xmm2)
+    ret   
+    
+    .p2align 4  
+.L__M_Above_52:
+    cmp $1024,%ecx #check if m = 1024
+    je .L__M_Equals_1024
+    #twopm.f64 * (S_L.f64 + (s*p+(S_T.f64 - twopmm.f64)));// 2^-m should not be calculated if m>105
+    mov $1023,%edx
+    sub %ecx,%edx                    #edx = twopmm
+    shl $52,%rdx
+    movd %rdx,%xmm1            #xmm1 = twopmm
+    subsd %xmm1,%xmm4  #xmm4 = S_T - twopmm
+    mulsd %xmm3,%xmm2  #xmm2 = s*p
+    addsd %xmm4,%xmm2 
+    addsd %xmm5,%xmm2
+    shl $52,%rcx
+    movd %rcx,%xmm0      #xmm0 = twopm
+    paddq %xmm2,%xmm0
+    ret
+    
+    .p2align 4    
+.L__M_Below_Minus7:
+    #twopm.f64 * (S_L.f64 + (s*p + S_T.f64)) - 1;
+    mulsd %xmm3,%xmm2    #xmm2 = s*p
+    addsd %xmm4,%xmm2   #xmm2 = (s*p + S_T.f64)
+    addsd %xmm5,%xmm2   #xmm2 = (S_L.f64 + (s*p + S_T.f64))
+    shl $52,%rcx
+    movd %rcx,%xmm0      #xmm0 = twopm
+    paddq %xmm2,%xmm0   #xmm0 = twopm *(xmm2)
+    subsd .L__One(%rip),%xmm0    
+    ret
+    
+    .p2align 4
+.L__M_Equals_1024:
+    mov $0x4000000000000000,%rax #1024 at exponent
+    mulsd %xmm3,%xmm2 #xmm2 = s*p
+    addsd %xmm4,%xmm2 #xmm2 = (s*p) + S_T
+    addsd %xmm5,%xmm2 #xmm2 = S_L + ((s*p) + S_T)
+    movd %rax,%xmm1 #xmm1 = twopm
+    paddq %xmm2,%xmm1
+    movd %xmm1,%rax
+    mov $0x7FF0000000000000,%rcx
+    and %rcx,%rax
+    cmp %rcx,%rax #check if we reached inf
+    je .L__return_Inf
+    movapd %xmm1,%xmm0                   
+    ret
+    
+    .p2align 4
+.L__Small_Arg:
+    movapd %xmm0,%xmm1
+    psllq $1,%xmm1
+    psrlq $1,%xmm1            #xmm1 = abs(x)
+    ucomisd .L__Five_Pont_FiveEMinus17(%rip),%xmm1
+    jb .L__VeryTinyArg
+    mov $0x01E0000000000000,%rax #30 in exponents place
+    #u = (twop30.f64 * x + x) - twop30.f64 * x;    
+    movd %rax,%xmm1
+    paddq %xmm0,%xmm1  #xmm1 = twop30.f64 * x
+    movapd %xmm1,%xmm2
+    addsd %xmm0,%xmm2 #xmm2 = (twop30.f64 * x + x)
+    subsd %xmm1,%xmm2 #xmm2 = u
+    movapd %xmm0,%xmm1
+    subsd %xmm2,%xmm1 #xmm1 = v = x-u
+    movapd %xmm2,%xmm3 #xmm3 = u
+    mulsd %xmm2,%xmm3 #xmm3 = u*u
+    mulsd .L__point_Five(%rip),%xmm3 #xmm3 = y = u*u*0.5
+    #z = v * (x + u) * 0.5;
+    movapd %xmm0,%xmm4
+    addsd %xmm2,%xmm4
+    mulsd %xmm1,%xmm4
+    mulsd .L__point_Five(%rip),%xmm4 #xmm4 = z   
+    
+    #q = x*x*x*(A1.f64 + x*(A2.f64 + x*(A3.f64 + x*(A4.f64 + x*(A5.f64 + x*(A6.f64 + x*(A7.f64 + x*(A8.f64 + x*(A9.f64)))))))));
+    movapd %xmm0,%xmm5
+    mulsd .L__B9(%rip),%xmm5
+    addsd .L__B8(%rip),%xmm5
+    mulsd %xmm0,%xmm5
+    addsd .L__B7(%rip),%xmm5
+    mulsd %xmm0,%xmm5
+    addsd .L__B6(%rip),%xmm5
+    mulsd %xmm0,%xmm5           
+    addsd .L__B5(%rip),%xmm5
+    mulsd %xmm0,%xmm5    
+    addsd .L__B4(%rip),%xmm5
+    mulsd %xmm0,%xmm5    
+    addsd .L__B3(%rip),%xmm5
+    mulsd %xmm0,%xmm5    
+    addsd .L__B2(%rip),%xmm5
+    mulsd %xmm0,%xmm5   
+    addsd .L__B1(%rip),%xmm5
+    mulsd %xmm0,%xmm5  
+    mulsd %xmm0,%xmm5  
+    mulsd %xmm0,%xmm5   #xmm5 = q
+    
+    ucomisd .L__TwopM7(%rip),%xmm3    
+    jb .L__returnNext
+    addsd %xmm4,%xmm1  #xmm1 = v+z
+    addsd %xmm5,%xmm1  #xmm1 = q+(v+z)
+    addsd %xmm3,%xmm2  #xmm2 = u+y
+    addsd %xmm2,%xmm1
+    movapd %xmm1,%xmm0
+    ret    
+    .p2align 4
+.L__returnNext:
+    addsd %xmm5,%xmm4  #xmm4 = q +z
+    addsd %xmm4,%xmm3  #xmm3 = y+(q+z)
+    addsd %xmm3,%xmm0    
+    ret
+    
+    .p2align 4  
+.L__VeryTinyArg:
+    #(twop100.f64 * x + xabs.f64) * twopm100.f64);
+    mov $0x0640000000000000,%rax #100 at exponent's place
+    movd %rax,%xmm2
+    paddq %xmm2,%xmm0
+    addsd %xmm1,%xmm0
+    psubq %xmm2,%xmm0
+    ret    
+      
+    
+    .p2align 4
+.L__Max_Arg:
+   movd %xmm0,%rcx
+   mov $0x7ff0000000000000,%rax
+   cmp %rax,%rcx                        #x is either Nan or Inf
+   jb .L__return_Inf
+   mov $0x000fffffffffffff,%rdx         #check if x is Nan
+   and %rdx,%rcx
+   jne .L__Nan
+.L__return_Inf:
+   movd %rax,%xmm0
+   #call error_handler  
+   ret
+   .p2align 4 
+.L__Nan:
+    addsd   %xmm0,%xmm0
+    ret      
+   ret
+    
+    .p2align 4  
+.L__Min_Arg:
+    mov $0xBFF0000000000000,%rax   #return -1
+    #call error handler
+    movd %rax,%xmm0
+    ret      
+    
+.data
+.align 16
+.L__max_expm1_arg:
+    .quad 0x40862E6666666666    
+.L__min_expm1_arg:
+    .quad 0xC042B708872320E1
+.L__log_OneMinus_OneByFour:
+    .quad 0xBFD269621134DB93
+.L__log_OnePlus_OneByFour:
+    .quad 0x3FCC8FF7C79A9A22
+.L__thirtyTwo_by_ln2:    
+    .quad 0x40471547652B82FE
+.L__zero:
+    .quad 0x0000000000000000    
+.L__point_Five:
+    .quad 0x3FE0000000000000
+    
+.align 16    
+.L__Ln2By32_MinusTrailLead:
+    .octa 0xBD8473DE6AF278ED3F962E42FEF00000 
+.L__A5:
+    .quad 0x3F56C1728D739765
+.L__A4:
+    .quad 0x3F811115B7AA905E
+.L__A3:
+    .quad 0x3FA5555555545D4E
+.L__A2:
+    .quad 0x3FC5555555548F7C
+.L__A1:
+    .quad 0x3FE0000000000000      
+.L__One:
+    .quad 0x3FF0000000000000
+
+.align 16
+# .type	two_to_jby32_table, @object
+# .size	two_to_jby32_table, 512
+S_lead_and_trail_table:
+	.octa  0x00000000000000003FF0000000000000
+	.octa  0x3D0A1D73E2A475B43FF059B0D3158540
+	.octa  0x3CEEC5317256E3083FF0B5586CF98900
+	.octa  0x3CF0A4EBBF1AED933FF11301D0125B40
+	.octa  0x3D0D6E6FBE4628763FF172B83C7D5140
+	.octa  0x3D053C02DC0144C83FF1D4873168B980
+	.octa  0x3D0C3360FD6D8E0B3FF2387A6E756200
+	.octa  0x3D009612E8AFAD123FF29E9DF51FDEC0
+	.octa  0x3CF52DE8D5A463063FF306FE0A31B700
+	.octa  0x3CE54E28AA05E8A93FF371A7373AA9C0
+	.octa  0x3D011ADA0911F09F3FF3DEA64C123400
+	.octa  0x3D068189B7A04EF83FF44E0860618900
+	.octa  0x3D038EA1CBD7F6213FF4BFDAD5362A00
+	.octa  0x3CBDF0A83C49D86A3FF5342B569D4F80
+	.octa  0x3D04AC64980A8C8F3FF5AB07DD485400
+	.octa  0x3CD2C7C3E81BF4B73FF6247EB03A5580
+	.octa  0x3CE921165F626CDD3FF6A09E667F3BC0
+	.octa  0x3D09EE91B87977853FF71F75E8EC5F40
+	.octa  0x3CDB5F54408FDB373FF7A11473EB0180
+	.octa  0x3CF28ACF88AFAB353FF82589994CCE00
+	.octa  0x3CFB5BA7C55A192D3FF8ACE5422AA0C0
+	.octa  0x3D027A280E1F92A03FF93737B0CDC5C0
+	.octa  0x3CF01C7C46B071F33FF9C49182A3F080
+	.octa  0x3CFC8B424491CAF83FFA5503B23E2540
+	.octa  0x3D06AF439A68BB993FFAE89F995AD380
+	.octa  0x3CDBAA9EC206AD4F3FFB7F76F2FB5E40
+	.octa  0x3CFC2220CB12A0923FFC199BDD855280
+	.octa  0x3D048A81E5E8F4A53FFCB720DCEF9040
+	.octa  0x3CDC976816BAD9B83FFD5818DCFBA480
+	.octa  0x3CFEB968CAC39ED33FFDFC97337B9B40
+	.octa  0x3CF9858F73A18F5E3FFEA4AFA2A490C0
+	.octa  0x3C99D3E12DD8A18B3FFF50765B6E4540
+
+.align 16
+.L__Five_Pont_FiveEMinus17:
+    .quad 0x3C90000000000000
+.L__B9:
+    .quad 0x3E5A2836AA646B96
+.L__B8:
+    .quad 0x3E928295484734EA
+.L__B7:
+    .quad 0x3EC71E14BFE3DB59
+.L__B6:
+    .quad 0x3EFA019F635825C4
+.L__B5:
+    .quad 0x3F2A01A01159DD2D
+.L__B4:
+    .quad 0x3F56C16C16CE14C6
+.L__B3:
+    .quad 0x3F8111111111A9F3
+.L__B2:
+    .quad 0x3FA55555555554B6
+.L__B1:
+    .quad 0x3FC5555555555549
+.L__TwopM7:
+    .quad 0x3F80000000000000

diff --git a/src/gas/expm1f.S b/src/gas/expm1f.S
new file mode 100644
index 0000000..6e7ca03
--- /dev/null
+++ b/src/gas/expm1f.S

@@ -0,0 +1,323 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(expm1f)
+#define fname_special _expm1f_special@PLT
+
+#ifdef __ELF__
+    .section .note.GNU-stack,"",@progbits
+#endif
+
+	.text
+	.p2align 4
+.globl fname
+	.type	fname, @function
+
+fname:
+    ucomiss .L__max_expm1_arg(%rip),%xmm0         ##if(x > max_expm1_arg)
+    ja .L__Max_Arg
+    jp .L__Max_Arg
+    ucomiss .L__log_OnePlus_OneByFour(%rip),%xmm0 ##if(x < log_OnePlus_OneByFour)
+    jae .L__Normal_Flow
+    ucomiss .L__log_OneMinus_OneByFour(%rip),%xmm0 ##if(x > log_OneMinus_OneByFour)
+    ja .L__Small_Arg
+    ucomiss .L__min_expm1_arg(%rip),%xmm0         ##if(x < min_expm1_arg)
+    jb .L__Min_Arg
+    
+    .p2align 4
+.L__Normal_Flow:
+    movaps %xmm0,%xmm1     #xmm1 = x
+    mulss .L__thirtyTwo_by_ln2(%rip),%xmm1   #xmm1 = x*thirtyTwo_by_ln2
+    movd %xmm1,%eax        #eax = x*thirtyTwo_by_ln2
+    and $0x80000000,%eax   #get the sign of x*thirtyTwo_by_ln2
+    or  $0x3F000000,%eax   #make +/- 0.5    
+    movd %eax,%xmm2        #xmm2 = +/- 0.5
+    addss %xmm2,%xmm1      #xmm1 = (x*32/ln2) +/- 0.5        
+	cvttps2dq %xmm1,%xmm2  #xmm2 = n = (int)(temp)
+	mov $0x0000001f,%edx
+	movd %edx,%xmm1
+	andps %xmm2,%xmm1      #xmm1 = j
+    movd %xmm2,%ecx        #ecx = n	
+	sarl	$5, %ecx       #ecx = m = n >> 5    
+	#xor %rdx,%rdx         #make it zeros, to be used for address	
+	movd %xmm1,%edx        #edx = j
+	lea  S_lead_and_trail_table(%rip),%rax	
+	movsd  (%rax,%rdx,8),%xmm3 #xmm3 = S_T,S_L
+    punpckldq %xmm2,%xmm1  #xmm1 = n,j    	
+	psubd %xmm1,%xmm2      #xmm2 = n1
+    punpcklqdq %xmm2,%xmm1 #xmm1 = n1,n,j    
+	cvtdq2ps %xmm1,%xmm1   #xmm1 = (float)(n1,n,j)
+		
+	#r2 = -(n*ln2_by_ThirtyTwo_trail);
+    #r1 = (x-n1*ln2_by_ThirtyTwo_lead) - j*ln2_by_ThirtyTwo_lead;	
+    mulps .L__Ln2By32_LeadTrailLead(%rip),%xmm1
+    movhlps %xmm1,%xmm2    #xmm2 = n1*ln2/32lead
+    movaps %xmm0,%xmm4     #xmm4 = x
+    subss %xmm2,%xmm4      #xmm4 = x - n1*ln2/32lead
+    subss %xmm1,%xmm4      #xmm4 = r1
+    psrldq $4,%xmm1        #xmm1 = -r2 should take care of sign later
+    
+    #r = r1 + r2;
+    movaps %xmm4,%xmm7     #xmm7 = r1   
+    subss %xmm1,%xmm4      #xmm4 = r = r1-(-r2) = r1 + r2
+    
+    #q = r*r*(B1+r*(B2));
+    movaps %xmm4,%xmm6         #xmm6 = r
+    mulss .L__B2_f(%rip),%xmm6 #xmm6 = r * B2
+    addss .L__B1_f(%rip),%xmm6 #xmm6 = B1 + (r * B2)
+    mulss %xmm4,%xmm6
+    mulss %xmm4,%xmm6          #xmm6 = q    
+    
+    #p = (r2+q) + r1;
+    subss %xmm1,%xmm6
+    addss %xmm7,%xmm6          #xmm6 = p
+
+    #s = S_L.f32 + S_T.f32;    
+    movdqa %xmm3,%xmm2     #xmm2 = S_T,S_L
+    psrldq $4,%xmm2        #xmm2 =     S_T
+    movaps %xmm2,%xmm5     #xmm5 =     S_T
+    addss %xmm3,%xmm2      #xmm2 = s    
+    
+    cmp  $0xfffffff9,%ecx  #Check m < -7
+    jl .L__M_Below_Minus7
+	cmp $23,%ecx           #Check m > 23
+	jg .L__M_Above_23
+	# -8 < m < 24
+    #twopm.f32 * ((S_L.f32 - twopmm.f32) + (S_L.f32*p+ S_T.f32 *(1+p)));
+    movaps %xmm3,%xmm2   #xmm2 = S_L
+    mulss %xmm6,%xmm2     #xmm2 = S_L * p
+    addss .L__One_f(%rip),%xmm6   #xmm6 = 1+p
+    mulss %xmm5,%xmm6     #xmm6 = S_T *(1+p)
+    addss %xmm6,%xmm2     #xmm2 = (S_L.f32*p+ S_T.f32 *(1+p))
+    mov $127,%eax
+    sub %ecx,%eax          #eax = 127 - m
+	shl  $23,%eax          #eax = 2^-m    
+    movd %eax,%xmm1    
+    subss %xmm1,%xmm3     #xmm3 = (S_L.f32 - twopmm.f32)
+    addss %xmm3,%xmm2     #xmm2 = ((S_L.f32 - twopmm.f32) + (S_L.f32*p+ S_T.f32 *(1+p)))   
+    shl  $23,%ecx
+    movd %ecx,%xmm0
+    paddd %xmm2,%xmm0
+    ret     
+   
+    .p2align 4
+.L__M_Below_Minus7:
+    #twopm.f32 * (S_L.f32 + (s*p + S_T.f32)) - 1;
+    mulss %xmm6,%xmm2     #xmm2 = s*p
+    addss %xmm5,%xmm2     #xmm2 = s*p + S_T
+    addss %xmm3,%xmm2     #xmm2 = (S_L.f32 + (s*p + S_T.f32))
+    shl  $23,%ecx  
+    movd %ecx,%xmm0
+    paddd %xmm2,%xmm0
+    subss .L__One_f(%rip),%xmm0            
+    ret  
+            
+    .p2align 4
+.L__M_Above_23:
+    #twopm.f32 * (S_L.f32 + (s*p+(S_T.f32 - twopmm.f32)));
+    cmp  $0x00000080,%ecx  #Check m < 128    
+    je .L__M_Equals_128        
+    cmp  $47,%ecx          #Check m > 47
+    ja .L__M_Above_47        
+    mov $127,%eax
+    sub %ecx,%eax          #eax = 127 - m
+	shl  $23,%eax          #eax = 2^-m    
+    movd %eax,%xmm1
+    subss %xmm1,%xmm5      #xmm5 = S_T.f32 - twopmm.f32
+    
+    .p2align 4
+.L__M_Above_47:    
+    shl  $23,%ecx    
+    mulss %xmm6,%xmm2      #xmm2 = s*p
+    addss %xmm5,%xmm2
+    addss %xmm3,%xmm2
+    movd %ecx,%xmm0
+    paddd %xmm2,%xmm0
+    ret    
+        
+    .p2align 4	
+.L__M_Equals_128:
+    mov $0x3f800000,%ecx  #127 at exponent
+    mulss %xmm6,%xmm2     #xmm2 = s*p
+    addss %xmm5,%xmm2     #xmm2 = s*p + S_T
+    addss %xmm3,%xmm2     #xmm2 = (S_L.f32 + (s*p + S_T.f32))
+    movd %ecx,%xmm1       #127
+    paddd %xmm2,%xmm1     #2^127*(S_L.f32 + (s*p + S_T.f32))
+    mov $0x00800000,%ecx  #multiply with one more 2
+    movd %ecx,%xmm2
+    paddd %xmm2,%xmm1
+    movd %xmm1,%ecx
+    and $0x7f800000,%ecx  #check if we reached +inf
+    cmp $0x7f800000,%ecx
+    je .L__Overflow
+    movdqa %xmm1,%xmm0
+    ret	
+	
+	.p2align 4
+.L__Small_Arg:
+    movd %xmm0,%eax
+    and $0x7fffffff,%eax    #eax = abs(x)
+    cmp $0x33000000,%eax    #check abs(x) < 2^-25
+    jl .L__VeryTiny_Arg
+    #log(1-1/4) < x < log(1+1/4)
+	#q = x*x*x*(A1 + x*(A2 + x*(A3 + x*(A4 + x*(A5)))));
+	movdqa %xmm0,%xmm1
+	mulss .L__A5_f(%rip),%xmm1
+	addss .L__A4_f(%rip),%xmm1
+	mulss %xmm0,%xmm1
+	addss .L__A3_f(%rip),%xmm1
+	mulss %xmm0,%xmm1
+	addss .L__A2_f(%rip),%xmm1
+	mulss %xmm0,%xmm1
+	addss .L__A1_f(%rip),%xmm1
+	mulss %xmm0,%xmm1
+	mulss %xmm0,%xmm1
+	mulss %xmm0,%xmm1
+	cvtps2pd %xmm0,%xmm2
+	movdqa %xmm2,%xmm0
+	mulsd %xmm0,%xmm2
+	mulsd .L__PointFive(%rip),%xmm2
+	addsd %xmm2,%xmm0
+	cvtps2pd %xmm1,%xmm2
+	addsd %xmm0,%xmm2
+	cvtpd2ps %xmm2,%xmm0	    
+    ret
+    
+	.p2align 4    
+.L__Min_Arg:
+    mov $0xBF800000,%eax
+    #call handle_error
+    movd %eax,%xmm0
+    ret    
+    
+	.p2align 4    
+.L__Max_Arg:
+    movd %xmm0,%eax
+    and $0x7fffffff,%eax    #eax = abs(x)
+    cmp $0x7f800000,%eax    #check for Nan
+    jae .L__Nan
+.L__Overflow:
+    mov $0x7f800000,%eax
+    #call handle_error
+    movd %eax,%xmm0
+    ret    
+.L__Nan:
+    and $0x007fffff,%eax
+    je .L__Overflow
+    addss %xmm0,%xmm0        
+    ret    
+
+    .p2align 4
+.L__VeryTiny_Arg:
+    #((twopm.f32 * x + xabs.f32) * twopmm.f32);
+	movd    %eax, %xmm1                                 #xmm1 = abs(x)																		
+	mov		$0x32000000, %eax 							#100 at exponent's place
+	movd	%eax, %xmm2
+	paddd	%xmm2, %xmm0								
+	addss	%xmm1, %xmm0
+	psubd   %xmm2, %xmm0
+	ret	    
+    
+.data
+.align 16
+.type	S_lead_and_trail_table, @object
+.size	S_lead_and_trail_table, 256
+S_lead_and_trail_table:
+	.quad	0x000000003F800000
+	.quad	0x355315853F82CD80
+	.quad	0x34D9F3123F85AAC0
+	.quad	0x35E8092E3F889800
+	.quad	0x3471F5463F8B95C0
+	.quad	0x36E62D173F8EA400
+	.quad	0x361B9D593F91C3C0
+	.quad	0x36BEA3FC3F94F4C0
+	.quad	0x36C146373F9837C0
+	.quad	0x36E6E7553F9B8D00
+	.quad	0x36C982473F9EF500
+	.quad	0x34C0C3123FA27040
+	.quad	0x36354D8B3FA5FEC0
+	.quad	0x3655A7543FA9A140
+	.quad	0x36FBA90B3FAD5800
+	.quad	0x36D6074B3FB123C0
+	.quad	0x36CCCFE73FB504C0
+	.quad	0x36BD1D8C3FB8FB80
+	.quad	0x368E7D603FBD0880
+	.quad	0x35CCA6673FC12C40
+	.quad	0x36A845543FC56700
+	.quad	0x36F619B93FC9B980
+	.quad	0x35C151F83FCE2480
+	.quad	0x366C8F893FD2A800
+	.quad	0x36F32B5A3FD744C0
+	.quad	0x36DE5F6C3FDBFB80
+	.quad	0x367761553FE0Ccc0
+	.quad	0x355CEF903FE5B900
+	.quad	0x355CFBA53FEAC0c0
+	.quad	0x36E66F733FEFE480
+	.quad	0x36F454923FF52540
+	.quad	0x36CB6DC93FFA8380
+
+.align 16
+.L__Ln2By32_LeadTrailLead:
+    .octa 0x333FBE8E3CB17200333FBE8E3CB17200
+    
+.L__max_expm1_arg:
+   .long 0x42B19999
+.L__log_OnePlus_OneByFour:
+   .long 0x3E647FBF
+  
+.L__log_OneMinus_OneByFour:
+   .long 0xBE934B11
+       
+.L__min_expm1_arg:
+   .long 0xC18AA122
+   
+.L__thirtyTwo_by_ln2:  
+   .long 0x4238AA3B
+  
+.align 16      
+.L__B2_f:
+   .long 0x3E2AAAEC      
+.L__B1_f:
+   .long 0x3F000044
+.L__One_f:
+   .long 0x3F800000  
+.L__PointFive:
+   .quad 0x3FE0000000000000
+
+.align 16
+.L__A1_f:
+   .long 0x3E2AAAAA   
+.L__A2_f:
+   .long 0x3D2AAAA0
+.L__A3_f:  
+   .long 0x3C0889FF     
+.L__A4_f:
+   .long 0x3AB64DE5
+.L__A5_f:
+   .long 0x394AB327
+    
+    
+
+    
+

diff --git a/src/gas/fabs.S b/src/gas/fabs.S
new file mode 100644
index 0000000..a436d0f
--- /dev/null
+++ b/src/gas/fabs.S

@@ -0,0 +1,63 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# fabs.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+#     double fabs(double x);
+#
+
+#
+#   Algorithm: AND the Most Significant Bit of the 
+#              double precision number with 0 to get the 
+#              floating point absolute.
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fabs)
+#define fname_special _fabs_special
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+    #input is in xmm0, which contains the final result also.
+    andpd .L__fabs_and_mask(%rip), %xmm0 # <result> latency = 3 
+    ret
+
+
+.align 16
+.L__fabs_and_mask:          .quad 0x7FFFFFFFFFFFFFFF
+                            .quad 0x0
+
+

diff --git a/src/gas/fabsf.S b/src/gas/fabsf.S
new file mode 100644
index 0000000..8a6ea27
--- /dev/null
+++ b/src/gas/fabsf.S

@@ -0,0 +1,67 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# fabsf.S
+#
+# An implementation of the fabsf libm function.
+#
+# Prototype:
+#
+#     float fabsf(float x);
+#
+
+#
+#   Algorithm: AND the Most Significant Bit of the 
+#              single precision number with 0 to get the 
+#              floating point absolute.
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fabsf)
+#define fname_special _fabsf_special
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+    #input is in xmm0, which contains the final result also.
+    andps .L__fabsf_and_mask(%rip), %xmm0 # <result> latency = 3 
+    ret
+
+
+.align 16
+.L__fabsf_and_mask:          .long 0x7FFFFFFF
+                             .long 0x0
+                             .quad 0x0
+
+
+
+
+

diff --git a/src/gas/fdim.S b/src/gas/fdim.S
new file mode 100644
index 0000000..14e382f
--- /dev/null
+++ b/src/gas/fdim.S

@@ -0,0 +1,63 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fdim.S
+#
+# An implementation of the fdim libm function.
+#
+#  The fdim functions determine the positive difference between their arguments
+#	
+#	x - y if x > y
+#	+0    if x <= y
+#
+#
+#
+# Prototype:
+#
+#     double fdim(double x, double y)
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fdim)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+	MOVAPD 	%xmm0,%xmm2	
+	SUBSD   %xmm1,%xmm0 	
+	CMPNLESD %xmm1,%xmm2
+	ANDPD 	 %xmm2,%xmm0
+	
+    ret

diff --git a/src/gas/fdimf.S b/src/gas/fdimf.S
new file mode 100644
index 0000000..0b7a966
--- /dev/null
+++ b/src/gas/fdimf.S

@@ -0,0 +1,61 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fdimf.S
+#
+# An implementation of the fdimf libm function.
+#
+#  The fdim functions determine the positive difference between their arguments
+#	
+#	x - y if x > y
+#	+0    if x <= y
+#
+# Prototype:
+#
+#     float fdimf(float x, float y)
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fdimf)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+	MOVAPD 	%xmm0,%xmm2	
+	SUBSS   %xmm1,%xmm0 	
+	CMPNLESS %xmm1,%xmm2
+	ANDPS 	 %xmm2,%xmm0
+	
+    ret

diff --git a/src/gas/fmax.S b/src/gas/fmax.S
new file mode 100644
index 0000000..ec0d787
--- /dev/null
+++ b/src/gas/fmax.S

@@ -0,0 +1,66 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fmax.S
+#
+# An implementation of the fmax libm function.
+#
+# The fmax functions determine the maximum numeric value of their arguments.
+#
+# Prototype:
+#
+#     double fmax(double x, double y)
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fmax)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+	MOVAPD %xmm0,%xmm3		
+	
+	MAXSD  %xmm1,%xmm0	
+	MOVAPD %xmm0,%xmm2		
+
+	#If the input is nan then specal case to return the other operand	
+	CMPEQSD %xmm2,%xmm2	
+	PAND	%xmm2,%xmm0
+	
+	PANDN   %xmm3,%xmm2
+	POR		%xmm2,%xmm0
+
+    ret
+

diff --git a/src/gas/fmaxf.S b/src/gas/fmaxf.S
new file mode 100644
index 0000000..828832f
--- /dev/null
+++ b/src/gas/fmaxf.S

@@ -0,0 +1,66 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fmaxf.S
+#
+# An implementation of the fmaxf libm function.
+#
+# The fmax functions determine the maximum numeric value of their arguments.
+#
+# Prototype:
+#
+#     float fmaxf(float x, float y)
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fmaxf)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+	MOVAPD %xmm0,%xmm3		
+	
+	MAXSS  %xmm1,%xmm0	
+	MOVAPD %xmm0,%xmm2		
+
+	#If the input is nan then specal case to return the other operand	
+	CMPEQSS %xmm2,%xmm2	
+	PAND	%xmm2,%xmm0
+	
+	PANDN   %xmm3,%xmm2
+	POR		%xmm2,%xmm0
+
+    ret
+

diff --git a/src/gas/fmin.S b/src/gas/fmin.S
new file mode 100644
index 0000000..79b3fb6
--- /dev/null
+++ b/src/gas/fmin.S

@@ -0,0 +1,66 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fmin.S
+#
+# An implementation of the fmin libm function.
+#
+# The fmin functions determine the minimum numeric value of their arguments
+#
+# Prototype:
+#
+#     double fmin(double x, double y)
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fmin)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+	MOVAPD %xmm0,%xmm3		
+	
+	MINSD  %xmm1,%xmm0	
+	MOVAPD %xmm0,%xmm2		
+
+	#If the input is nan then specal case to return the other operand	
+	CMPEQSD %xmm2,%xmm2	
+	PAND	%xmm2,%xmm0
+	
+	PANDN   %xmm3,%xmm2
+	POR		%xmm2,%xmm0
+
+    ret
+

diff --git a/src/gas/fminf.S b/src/gas/fminf.S
new file mode 100644
index 0000000..34ee357
--- /dev/null
+++ b/src/gas/fminf.S

@@ -0,0 +1,66 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fminf.S
+#
+# An implementation of the fminf libm function.
+#
+# The fmin functions determine the minimum numeric value of their arguments
+#
+#
+# Prototype:
+#
+#     float fminf(float x, float y)
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fminf)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+	MOVAPD %xmm0,%xmm3		
+	
+	MINSS  %xmm1,%xmm0	
+	MOVAPD %xmm0,%xmm2		
+
+	#If the input is nan then specal case to return the other operand	
+	CMPEQSS %xmm2,%xmm2	
+	PAND	%xmm2,%xmm0
+	
+	PANDN   %xmm3,%xmm2
+	POR		%xmm2,%xmm0	
+
+    ret

diff --git a/src/gas/fmod.S b/src/gas/fmod.S
new file mode 100644
index 0000000..bc1eeae
--- /dev/null
+++ b/src/gas/fmod.S

@@ -0,0 +1,223 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# fmod.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+#     double fmod(double x,double y);
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fmod)
+#define fname_special _fmod_special
+
+
+# local variable storage offsets
+.equ    temp_x, 0x0
+.equ    temp_y, 0x10
+.equ    stack_size,  0x28
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+    mov .L__exp_mask_64(%rip), %r10
+    #move the input to GP registers
+    movd %xmm0,%r8
+    movd %xmm1,%r9
+    movapd %xmm0,%xmm4
+    movapd %xmm1,%xmm5
+    movapd .L__Nan_64(%rip),%xmm6
+    and %r10,%r8
+    and %r10,%r9
+    ror $52, %r8
+    ror $52, %r9
+    #ifeither of the exponents is zero we do the fmod calculation in x87 mode
+    test %r8, %r8
+    jz  .L__LargeExpDiffComputation
+    mov %r9,%r10
+    test %r9, %r9
+    jz  .L__LargeExpDiffComputation
+    sub %r9,%r8 
+    cmp $52,%r8
+    jge .L__LargeExpDiffComputation
+    pand %xmm6,%xmm4
+    pand %xmm6,%xmm5
+    comisd %xmm5,%xmm4
+    jp  .L__InputIsNaN # if either of xmm1 or xmm0 is a NaN then 
+                       # parity flag is set
+    jz  .L__Input_Is_Equal
+    jbe .L__ReturnImmediate
+    cmp $0x7FF,%r8
+    jz  .L__Dividend_Is_Infinity
+
+    #calculation without using the x87 FPU
+.L__DirectComputation:
+    movapd %xmm4,%xmm2
+    movapd %xmm5,%xmm3
+    divsd %xmm3,%xmm2
+    cvttsd2siq %xmm2,%r8
+    cvtsi2sdq %r8,%xmm2
+
+    #multiplication in QUAD Precision
+    #Since the below commented multiplication resulted in an error
+    #we had to implement a quad precision multiplication.
+    #LOGIC behind Quad Precision Multiplication
+    #x = hx + tx   by setting x's last 27 bits to null
+    #y = hy + ty   similar to x
+    movapd .L__27bit_andingmask_64(%rip),%xmm4 
+    #movddup %xmm5,%xmm5 #[x,x]
+    #movddup %xmm2,%xmm2 #[y,y]
+
+    movapd %xmm5,%xmm1 # x
+    movapd %xmm2,%xmm6 # y
+    movapd %xmm2,%xmm7 # 
+    mulsd  %xmm5,%xmm7 # xmm7 = z = x*y
+    andpd  %xmm4,%xmm1
+    andpd  %xmm4,%xmm2
+    subsd  %xmm1,%xmm5 # xmm1 = hx   xmm5 = tx
+    subsd  %xmm2,%xmm6 # xmm2 = hy   xmm6 = ty
+
+    movapd %xmm1,%xmm4 # copy hx
+    mulsd  %xmm2,%xmm4 # xmm4 = hx*hy
+    subsd  %xmm7,%xmm4 # xmm4 = (hx*hy - z)
+    mulsd  %xmm6,%xmm1 # xmm1 = hx * ty
+    addsd  %xmm1,%xmm4 # xmm4 = ((hx * hy - *z) + hx * ty)
+    mulsd  %xmm5,%xmm2 # xmm2 = tx * hy
+    addsd  %xmm2,%xmm4 # xmm4 = (((hx * hy - *z) + hx * ty) + tx * hy)
+    mulsd  %xmm5,%xmm6 # xmm6 = tx * ty
+    addsd  %xmm4,%xmm6 # xmm6 = (((hx * hy - *z) + hx * ty) + tx * hy) + tx * ty;
+    #xmm6 and xmm7 contain the quad precision result
+    #v = dx - c;
+    #dx = v + (((dx - v) - c) - cc);
+    movapd %xmm0,%xmm1 # copy the input number
+    pand   .L__Nan_64(%rip),%xmm1
+    movapd %xmm1,%xmm2 # xmm2 = dx = xmm1
+    subsd  %xmm7,%xmm1 # v = dx - c
+    subsd  %xmm1,%xmm2 # (dx - v)
+    subsd  %xmm7,%xmm2 # ((dx - v) - c)
+    subsd  %xmm6,%xmm2 # (((dx - v) - c) - cc)
+    addsd  %xmm1,%xmm2 # xmm2 = dx = v + (((dx - v) - c) - cc) 
+                       # xmm3 = w
+    comisd .L__Zero_64(%rip),%xmm2
+    jae .L__positive
+    addsd  %xmm3,%xmm2
+.L__positive:    
+#  return x < 0.0? -dx : dx;
+.L__Finish:
+    comisd .L__Zero_64(%rip), %xmm0 
+    ja  .L__Not_Negative_Number1
+
+.L__Negative_Number1:
+    movapd .L__Zero_64(%rip),%xmm0
+    subsd  %xmm2,%xmm0
+    ret 
+.L__Not_Negative_Number1:
+    movapd %xmm2,%xmm0 
+    ret 
+
+    #calculation using the x87 FPU
+    #For numbers whose exponent of either of the divisor,
+    #or dividends are 0. Or for numbers whose exponential 
+    #diff is grater than 52
+.align 16
+.L__LargeExpDiffComputation:
+    sub $stack_size, %rsp
+    movsd %xmm0, temp_x(%rsp)
+    movsd %xmm1, temp_y(%rsp)
+    ffree %st(0)
+    ffree %st(1)
+    fldl  temp_y(%rsp)
+    fldl  temp_x(%rsp)
+    fnclex
+.align 32
+.L__repeat:    
+    fprem #Calculate remainder by dividing st(0) with st(1)
+          #fprem operation sets x87 condition codes, 
+          #it will set the C2 code to 1 if a partial remainder is calculated
+    fnstsw %ax 
+    and $0x0400,%ax # Stores Floating-Point Store Status Word into the accumulator
+                    # we need to check only the C2 bit of the Condition codes
+    cmp $0x0400,%ax # Checks whether the bit 10(C2) is set or not 
+                    # IF its set then a partial remainder was calculated
+    jz .L__repeat   
+    #store the result from the FPU stack to memory
+    fstpl   temp_x(%rsp) 
+    fstpl   temp_y(%rsp) 
+    movsd   temp_x(%rsp), %xmm0 
+    add $stack_size, %rsp
+    ret 
+
+    #IF both the inputs are equal
+.L__Input_Is_Equal:
+    cmp $0x7FF,%r8
+    jz .L__Dividend_Is_Infinity
+    cmp $0x7FF,%r9
+    jz .L__InputIsNaN
+    movsd %xmm0,%xmm1
+    pand .L__sign_mask_64(%rip),%xmm1
+    movsd .L__Zero_64(%rip),%xmm0
+    por  %xmm1,%xmm0
+    ret
+
+.L__InputIsNaN:
+    por .L__QNaN_mask_64(%rip),%xmm0
+    por .L__exp_mask_64(%rip),%xmm0
+.L__Dividend_Is_Infinity:
+    ret
+
+#Case when x < y
+.L__ReturnImmediate:
+    ret
+    
+
+
+.align 32    
+.L__sign_mask_64:          .quad 0x8000000000000000
+                           .quad 0x0
+.L__exp_mask_64:           .quad 0x7FF0000000000000
+                           .quad 0x0
+.L__27bit_andingmask_64:   .quad 0xfffffffff8000000
+                           .quad 0
+.L__2p52_mask_64:          .quad 0x4330000000000000 
+                           .quad 0
+.L__Zero_64:               .quad 0x0 
+                           .quad 0
+.L__QNaN_mask_64:          .quad 0x0008000000000000 
+                           .quad 0
+.L__Nan_64:                .quad 0x7FFFFFFFFFFFFFFF
+                           .quad 0
+

diff --git a/src/gas/fmodf.S b/src/gas/fmodf.S
new file mode 100644
index 0000000..c31d619
--- /dev/null
+++ b/src/gas/fmodf.S

@@ -0,0 +1,181 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# fmodf.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+#     float fmodf(float x,float y);
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fmodf)
+#define fname_special _fmodf_special
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+    mov .L__exp_mask_64(%rip), %rdi
+    movapd .L__sign_mask_64(%rip),%xmm6
+    cvtss2sd %xmm0,%xmm2 # double x
+    cvtss2sd %xmm1,%xmm3 # double y
+    pand %xmm6,%xmm2
+    pand %xmm6,%xmm3
+    movd %xmm2,%rax
+    movd %xmm3,%r8
+    mov %rax,%r11
+    mov %r8,%r9
+    movsd %xmm2,%xmm4
+    #take the exponents of both x and y
+    and %rdi,%rax
+    and %rdi,%r8
+    ror $52, %rax
+    ror $52, %r8
+    # ifeither of the exponents is infinity 
+    cmp $0X7FF,%rax 
+    jz  .L__InputIsNaN 
+    cmp $0X7FF,%r8 
+    jz  .L__InputIsNaNOrInf
+
+    cmp $0,%r8
+    jz  .L__Divisor_Is_Zero
+
+    cmp %r9, %r11
+    jz  .L__Input_Is_Equal
+    jb  .L__ReturnImmediate
+
+    xor %rcx,%rcx
+    mov $24,%rdx
+    movsd .L__One_64(%rip),%xmm7 # xmm7 = scale
+    cmp %rax,%r8 
+    jae .L__y_is_greater
+    #xmm3 = dy
+    sub %r8,%rax
+    div %dl       # al = ntimes
+    mov %al,%cl   # cl = ntimes
+    and $0xFF,%ax # set everything t o zero except al
+    mul %dl       # ax = dl * al = 24* ntimes
+    add $1023, %rax
+    shl $52,%rax
+    movd %rax,%xmm7 # xmm7 = scale
+.L__y_is_greater:
+    mulsd %xmm3,%xmm7 # xmm7 = scale * dy
+    movsd .L__2pminus24_decimal(%rip),%xmm6
+
+.align 16
+.L__Start_Loop:
+    dec %cl
+    js .L__End_Loop
+    divsd %xmm7,%xmm4     # xmm7 = (dx / w)
+    cvttsd2siq %xmm4,%rax 
+    cvtsi2sdq %rax,%xmm4  # xmm4 = t = (double)((int)(dx / w))
+    mulsd  %xmm7,%xmm4    # xmm4 = w*t
+    mulsd %xmm6,%xmm7     # w*= scale 
+    subsd  %xmm4,%xmm2    # xmm2 = dx -= w*t  
+    movsd %xmm2,%xmm4     # xmm4 = dx
+    jmp .L__Start_Loop
+.L__End_Loop:    
+    divsd %xmm7,%xmm4     # xmm7 = (dx / w)
+    cvttsd2siq %xmm4,%rax 
+    cvtsi2sdq %rax,%xmm4   # xmm4 = t = (double)((int)(dx / w))
+    mulsd  %xmm7,%xmm4    # xmm4 = w*t
+    subsd  %xmm4,%xmm2    # xmm2 = dx -= w*t  
+    comiss .L__Zero_64(%rip),%xmm0 
+    jb .L__Negative
+.L__Positive:
+    cvtsd2ss %xmm2,%xmm0 
+    ret
+.L__Negative:
+    movsd .L__MinusZero_64(%rip),%xmm0
+    subsd %xmm2,%xmm0
+    cvtsd2ss %xmm0,%xmm0 
+    ret
+
+.align 16
+.L__Input_Is_Equal:
+    cmp $0x7FF,%rax
+    jz .L__Dividend_Is_Infinity
+    cmp $0x7FF,%r8
+    jz .L__InputIsNaNOrInf
+    movsd %xmm0,%xmm1
+    pand .L__sign_bit_32(%rip),%xmm1
+    movss .L__Zero_64(%rip),%xmm0
+    por  %xmm1,%xmm0
+    ret
+
+.L__InputIsNaNOrInf:
+    comiss %xmm0,%xmm1
+    jp .L__InputIsNaN
+    ret 
+.L__Divisor_Is_Zero:    
+.L__InputIsNaN:
+    por .L__exp_mask_32(%rip),%xmm0
+.L__Dividend_Is_Infinity:
+    por .L__QNaN_mask_32(%rip),%xmm0
+    ret
+
+#Case when x < y
+.L__ReturnImmediate:
+    #xmm0 contains the input and is the result
+    ret
+        
+
+
+.align 32    
+.L__sign_bit_32:           .quad 0x8000000080000000
+                           .quad 0x0
+.L__exp_mask_64:           .quad 0x7FF0000000000000
+                           .quad 0x0
+.L__exp_mask_32:           .quad 0x000000007F800000
+                           .quad 0x0
+.L__27bit_andingmask_64:   .quad 0xfffffffff8000000
+                           .quad 0
+.L__2p52_mask_64:          .quad 0x4330000000000000 
+                           .quad 0
+.L__One_64:                .quad 0x3FF0000000000000 
+                           .quad 0
+.L__Zero_64:               .quad 0x0 
+                           .quad 0
+.L__MinusZero_64:          .quad 0x8000000000000000 
+                           .quad 0
+.L__QNaN_mask_32:          .quad 0x0000000000400000
+                           .quad 0
+.L__sign_mask_64:          .quad 0x7FFFFFFFFFFFFFFF
+                           .quad 0
+.L__2pminus24_decimal:     .quad 0x3E70000000000000
+                           .quad 0
+

diff --git a/src/gas/log.S b/src/gas/log.S
new file mode 100644
index 0000000..7068c6d
--- /dev/null
+++ b/src/gas/log.S

@@ -0,0 +1,1155 @@
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+#ifdef __x86_64__
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# log.S
+#
+# An implementation of the log libm function.
+#
+# Prototype:
+#
+#     double log(double x);
+#
+
+#
+#   Algorithm:
+#
+#   Based on:
+#   Ping-Tak Peter Tang
+#   "Table-driven implementation of the logarithm function in IEEE
+#   floating-point arithmetic"
+#   ACM Transactions on Mathematical Software (TOMS)
+#   Volume 16, Issue 4 (December 1990)
+#
+#
+#   x very close to 1.0 is handled differently, for x everywhere else
+#   a brief explanation is given below
+#
+#   x = (2^m)*A
+#   x = (2^m)*(G+g) with (1 <= G < 2) and (g <= 2^(-9))
+#   x = (2^m)*2*(G/2+g/2)
+#   x = (2^m)*2*(F+f) with (0.5 <= F < 1) and (f <= 2^(-10))
+#   
+#   Y = (2^(-1))*(2^(-m))*(2^m)*A
+#   Now, range of Y is: 0.5 <= Y < 1
+#
+#   F = 0x100 + (first 8 mantissa bits) + (9th mantissa bit)
+#   Now, range of F is: 256 <= F <= 512
+#   F = F / 512
+#   Now, range of F is: 0.5 <= F <= 1
+#
+#   f = -(Y-F), with (f <= 2^(-10))
+#
+#   log(x) = m*log(2) + log(2) + log(F-f)
+#   log(x) = m*log(2) + log(2) + log(F) + log(1-(f/F))
+#   log(x) = m*log(2) + log(2*F) + log(1-r)
+#
+#   r = (f/F), with (r <= 2^(-9))
+#   r = f*(1/F) with (1/F) precomputed to avoid division
+#
+#   log(x) = m*log(2) + log(G) - poly
+#
+#   log(G) is precomputed
+#   poly = (r + (r^2)/2 + (r^3)/3 + (r^4)/4) + (r^5)/5) + (r^6)/6))
+#
+#   log(2) and log(G) need to be maintained in extra precision
+#   to avoid losing precision in the calculations
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(log)
+#define fname_special _log_special@PLT
+
+
+# local variable storage offsets
+.equ    p_temp, 0x0
+.equ    stack_size, 0x18
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+    sub         $stack_size, %rsp
+
+    # compute exponent part
+    xor         %rax, %rax
+    movdqa      %xmm0, %xmm3
+    movsd       %xmm0, %xmm4
+    psrlq       $52, %xmm3
+    movd        %xmm0, %rax
+    psubq       .L__mask_1023(%rip), %xmm3
+    movdqa      %xmm0, %xmm2
+    cvtdq2pd    %xmm3, %xmm6 # xexp
+
+    #  NaN or inf
+    movdqa      %xmm0, %xmm5
+    andpd       .L__real_inf(%rip), %xmm5
+    comisd      .L__real_inf(%rip), %xmm5
+    je          .L__x_is_inf_or_nan
+
+    # check for negative numbers or zero
+    xorpd       %xmm5, %xmm5
+    comisd      %xmm5, %xmm0
+    jbe         .L__x_is_zero_or_neg
+
+    pand        .L__real_mant(%rip), %xmm2
+    subsd       .L__real_one(%rip), %xmm4
+
+    comisd      .L__mask_1023_f(%rip), %xmm6
+    je          .L__denormal_adjust
+
+.L__continue_common:    
+
+    # compute index into the log tables
+    mov         %rax, %r9
+    and         .L__mask_mant_all8(%rip), %rax
+    and         .L__mask_mant9(%rip), %r9
+    shl         $1, %r9
+    add         %r9, %rax
+    mov         %rax, p_temp(%rsp)
+
+    # near one codepath
+    andpd       .L__real_notsign(%rip), %xmm4
+    comisd      .L__real_threshold(%rip), %xmm4
+    jb          .L__near_one
+
+    # F, Y
+    movsd       p_temp(%rsp), %xmm1
+    shr         $44, %rax
+    por         .L__real_half(%rip), %xmm2
+    por         .L__real_half(%rip), %xmm1
+    lea         .L__log_F_inv(%rip), %r9
+
+    # f = F - Y, r = f * inv
+    subsd       %xmm2, %xmm1
+    mulsd       (%r9,%rax,8), %xmm1
+
+    movsd       %xmm1, %xmm2
+    movsd       %xmm1, %xmm0
+    lea         .L__log_256_lead(%rip), %r9
+
+    # poly
+    movsd       .L__real_1_over_6(%rip), %xmm3
+    movsd       .L__real_1_over_3(%rip), %xmm1
+    mulsd       %xmm2, %xmm3                         
+    mulsd       %xmm2, %xmm1                         
+    mulsd       %xmm2, %xmm0                         
+    movsd       %xmm0, %xmm4
+    addsd       .L__real_1_over_5(%rip), %xmm3
+    addsd       .L__real_1_over_2(%rip), %xmm1
+    mulsd       %xmm0, %xmm4                         
+    mulsd       %xmm2, %xmm3                         
+    mulsd       %xmm0, %xmm1                         
+    addsd       .L__real_1_over_4(%rip), %xmm3
+    addsd       %xmm2, %xmm1                         
+    mulsd       %xmm4, %xmm3                         
+    addsd       %xmm3, %xmm1                         
+
+    # m*log(2) + log(G) - poly
+    movsd       .L__real_log2_tail(%rip), %xmm5
+    mulsd       %xmm6, %xmm5
+    subsd       %xmm1, %xmm5
+
+    movsd       (%r9,%rax,8), %xmm0
+    lea         .L__log_256_tail(%rip), %rdx
+    movsd       (%rdx,%rax,8), %xmm2
+    addsd       %xmm5, %xmm2
+
+    movsd       .L__real_log2_lead(%rip), %xmm4
+    mulsd       %xmm6, %xmm4
+    addsd       %xmm4, %xmm0
+
+    addsd       %xmm2, %xmm0
+
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__near_one:
+
+    # r = x - 1.0
+    movsd       .L__real_two(%rip), %xmm2
+    subsd       .L__real_one(%rip), %xmm0 # r
+
+    addsd       %xmm0, %xmm2
+    movsd       %xmm0, %xmm1
+    divsd       %xmm2, %xmm1 # r/(2+r) = u/2
+
+    movsd       .L__real_ca2(%rip), %xmm4
+    movsd       .L__real_ca4(%rip), %xmm5
+
+    movsd       %xmm0, %xmm6
+    mulsd       %xmm1, %xmm6 # correction
+
+    addsd       %xmm1, %xmm1 # u
+    movsd       %xmm1, %xmm2
+
+    mulsd       %xmm1, %xmm2 # u^2
+
+    mulsd       %xmm2, %xmm4
+    mulsd       %xmm2, %xmm5
+
+    addsd       .L__real_ca1(%rip), %xmm4
+    addsd       .L__real_ca3(%rip), %xmm5
+
+    mulsd       %xmm1, %xmm2 # u^3
+    mulsd       %xmm2, %xmm4
+
+    mulsd       %xmm2, %xmm2
+    mulsd       %xmm1, %xmm2 # u^7
+    mulsd       %xmm2, %xmm5
+
+    addsd       %xmm5, %xmm4
+    subsd       %xmm6, %xmm4
+    addsd       %xmm4, %xmm0
+ 
+    add         $stack_size, %rsp
+    ret
+
+.L__denormal_adjust:
+    por         .L__real_one(%rip), %xmm2
+    subsd       .L__real_one(%rip), %xmm2
+    movsd       %xmm2, %xmm5
+    pand        .L__real_mant(%rip), %xmm2
+    movd        %xmm2, %rax
+    psrlq       $52, %xmm5
+    psubd       .L__mask_2045(%rip), %xmm5
+    cvtdq2pd    %xmm5, %xmm6
+    jmp         .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+    jne         .L__x_is_neg
+
+    movsd       .L__real_ninf(%rip), %xmm1
+    mov         .L__flag_x_zero(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+    movsd       .L__real_qnan(%rip), %xmm1
+    mov         .L__flag_x_neg(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+    cmp         .L__real_inf(%rip), %rax
+    je          .L__finish
+
+    cmp         .L__real_ninf(%rip), %rax
+    je          .L__x_is_neg
+
+    mov         .L__real_qnanbit(%rip), %r9
+    and         %rax, %r9
+    jnz         .L__finish
+
+    or          .L__real_qnanbit(%rip), %rax
+    movd        %rax, %xmm1
+    mov         .L__flag_x_nan(%rip), %edi
+    call        fname_special
+    jmp         .L__finish    
+
+.p2align 4,,15
+.L__finish:
+    add         $stack_size, %rsp
+    ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero:        .long 00000001
+.L__flag_x_neg:         .long 00000002
+.L__flag_x_nan:         .long 00000003
+
+.align 16
+
+.L__real_ninf:      .quad 0x0fff0000000000000   # -inf
+                    .quad 0x0000000000000000
+.L__real_inf:       .quad 0x7ff0000000000000    # +inf
+                    .quad 0x0000000000000000
+.L__real_qnan:      .quad 0x7ff8000000000000   # qNaN
+                    .quad 0x0000000000000000
+.L__real_qnanbit:   .quad 0x0008000000000000
+                    .quad 0x0000000000000000
+.L__real_mant:      .quad 0x000FFFFFFFFFFFFF    # mantissa bits
+                    .quad 0x0000000000000000
+.L__mask_1023:      .quad 0x00000000000003ff
+                    .quad 0x0000000000000000
+.L__mask_001:       .quad 0x0000000000000001
+                    .quad 0x0000000000000000
+
+.L__mask_mant_all8: .quad 0x000ff00000000000
+                    .quad 0x0000000000000000
+.L__mask_mant9:     .quad 0x0000080000000000
+                    .quad 0x0000000000000000
+
+.L__real_log2_lead: .quad 0x3fe62e42e0000000 # log2_lead  6.93147122859954833984e-01
+                    .quad 0x0000000000000000
+.L__real_log2_tail: .quad 0x3e6efa39ef35793c # log2_tail  5.76999904754328540596e-08
+                    .quad 0x0000000000000000
+
+.L__real_two:       .quad 0x4000000000000000 # 2
+                    .quad 0x0000000000000000
+
+.L__real_one:       .quad 0x3ff0000000000000 # 1
+                    .quad 0x0000000000000000
+
+.L__real_half:      .quad 0x3fe0000000000000 # 1/2
+                    .quad 0x0000000000000000
+
+.L__mask_100:       .quad 0x0000000000000100
+                    .quad 0x0000000000000000
+
+.L__real_1_over_512:    .quad 0x3f60000000000000
+                        .quad 0x0000000000000000
+
+.L__real_1_over_2:  .quad 0x3fe0000000000000
+                    .quad 0x0000000000000000
+.L__real_1_over_3:  .quad 0x3fd5555555555555
+                    .quad 0x0000000000000000
+.L__real_1_over_4:  .quad 0x3fd0000000000000
+                    .quad 0x0000000000000000
+.L__real_1_over_5:  .quad 0x3fc999999999999a
+                    .quad 0x0000000000000000
+.L__real_1_over_6:  .quad 0x3fc5555555555555
+                    .quad 0x0000000000000000
+
+.L__mask_1023_f:    .quad 0x0c08ff80000000000
+                    .quad 0x0000000000000000
+
+.L__mask_2045:      .quad 0x00000000000007fd
+                    .quad 0x0000000000000000
+
+.L__real_threshold: .quad 0x3fb0000000000000 # .0625
+                    .quad 0x0000000000000000
+
+.L__real_notsign:   .quad 0x7ffFFFFFFFFFFFFF # ^sign bit
+                    .quad 0x0000000000000000
+
+.L__real_ca1:       .quad 0x3fb55555555554e6 # 8.33333333333317923934e-02
+                    .quad 0x0000000000000000
+.L__real_ca2:       .quad 0x3f89999999bac6d4 # 1.25000000037717509602e-02
+                    .quad 0x0000000000000000
+.L__real_ca3:       .quad 0x3f62492307f1519f # 2.23213998791944806202e-03
+                    .quad 0x0000000000000000
+.L__real_ca4:       .quad 0x3f3c8034c85dfff0 # 4.34887777707614552256e-04
+                    .quad 0x0000000000000000
+
+.align 16
+.L__log_256_lead:
+                    .quad 0x0000000000000000
+                    .quad 0x3f6ff00aa0000000
+                    .quad 0x3f7fe02a60000000
+                    .quad 0x3f87dc4750000000
+                    .quad 0x3f8fc0a8b0000000
+                    .quad 0x3f93cea440000000
+                    .quad 0x3f97b91b00000000
+                    .quad 0x3f9b9fc020000000
+                    .quad 0x3f9f829b00000000
+                    .quad 0x3fa1b0d980000000
+                    .quad 0x3fa39e87b0000000
+                    .quad 0x3fa58a5ba0000000
+                    .quad 0x3fa77458f0000000
+                    .quad 0x3fa95c8300000000
+                    .quad 0x3fab42dd70000000
+                    .quad 0x3fad276b80000000
+                    .quad 0x3faf0a30c0000000
+                    .quad 0x3fb0759830000000
+                    .quad 0x3fb16536e0000000
+                    .quad 0x3fb253f620000000
+                    .quad 0x3fb341d790000000
+                    .quad 0x3fb42edcb0000000
+                    .quad 0x3fb51b0730000000
+                    .quad 0x3fb60658a0000000
+                    .quad 0x3fb6f0d280000000
+                    .quad 0x3fb7da7660000000
+                    .quad 0x3fb8c345d0000000
+                    .quad 0x3fb9ab4240000000
+                    .quad 0x3fba926d30000000
+                    .quad 0x3fbb78c820000000
+                    .quad 0x3fbc5e5480000000
+                    .quad 0x3fbd4313d0000000
+                    .quad 0x3fbe270760000000
+                    .quad 0x3fbf0a30c0000000
+                    .quad 0x3fbfec9130000000
+                    .quad 0x3fc0671510000000
+                    .quad 0x3fc0d77e70000000
+                    .quad 0x3fc1478580000000
+                    .quad 0x3fc1b72ad0000000
+                    .quad 0x3fc2266f10000000
+                    .quad 0x3fc29552f0000000
+                    .quad 0x3fc303d710000000
+                    .quad 0x3fc371fc20000000
+                    .quad 0x3fc3dfc2b0000000
+                    .quad 0x3fc44d2b60000000
+                    .quad 0x3fc4ba36f0000000
+                    .quad 0x3fc526e5e0000000
+                    .quad 0x3fc59338d0000000
+                    .quad 0x3fc5ff3070000000
+                    .quad 0x3fc66acd40000000
+                    .quad 0x3fc6d60fe0000000
+                    .quad 0x3fc740f8f0000000
+                    .quad 0x3fc7ab8900000000
+                    .quad 0x3fc815c0a0000000
+                    .quad 0x3fc87fa060000000
+                    .quad 0x3fc8e928d0000000
+                    .quad 0x3fc9525a90000000
+                    .quad 0x3fc9bb3620000000
+                    .quad 0x3fca23bc10000000
+                    .quad 0x3fca8becf0000000
+                    .quad 0x3fcaf3c940000000
+                    .quad 0x3fcb5b5190000000
+                    .quad 0x3fcbc28670000000
+                    .quad 0x3fcc296850000000
+                    .quad 0x3fcc8ff7c0000000
+                    .quad 0x3fccf63540000000
+                    .quad 0x3fcd5c2160000000
+                    .quad 0x3fcdc1bca0000000
+                    .quad 0x3fce270760000000
+                    .quad 0x3fce8c0250000000
+                    .quad 0x3fcef0adc0000000
+                    .quad 0x3fcf550a50000000
+                    .quad 0x3fcfb91860000000
+                    .quad 0x3fd00e6c40000000
+                    .quad 0x3fd0402590000000
+                    .quad 0x3fd071b850000000
+                    .quad 0x3fd0a324e0000000
+                    .quad 0x3fd0d46b50000000
+                    .quad 0x3fd1058bf0000000
+                    .quad 0x3fd1368700000000
+                    .quad 0x3fd1675ca0000000
+                    .quad 0x3fd1980d20000000
+                    .quad 0x3fd1c898c0000000
+                    .quad 0x3fd1f8ff90000000
+                    .quad 0x3fd22941f0000000
+                    .quad 0x3fd2596010000000
+                    .quad 0x3fd2895a10000000
+                    .quad 0x3fd2b93030000000
+                    .quad 0x3fd2e8e2b0000000
+                    .quad 0x3fd31871c0000000
+                    .quad 0x3fd347dd90000000
+                    .quad 0x3fd3772660000000
+                    .quad 0x3fd3a64c50000000
+                    .quad 0x3fd3d54fa0000000
+                    .quad 0x3fd4043080000000
+                    .quad 0x3fd432ef20000000
+                    .quad 0x3fd4618bc0000000
+                    .quad 0x3fd4900680000000
+                    .quad 0x3fd4be5f90000000
+                    .quad 0x3fd4ec9730000000
+                    .quad 0x3fd51aad80000000
+                    .quad 0x3fd548a2c0000000
+                    .quad 0x3fd5767710000000
+                    .quad 0x3fd5a42ab0000000
+                    .quad 0x3fd5d1bdb0000000
+                    .quad 0x3fd5ff3070000000
+                    .quad 0x3fd62c82f0000000
+                    .quad 0x3fd659b570000000
+                    .quad 0x3fd686c810000000
+                    .quad 0x3fd6b3bb20000000
+                    .quad 0x3fd6e08ea0000000
+                    .quad 0x3fd70d42e0000000
+                    .quad 0x3fd739d7f0000000
+                    .quad 0x3fd7664e10000000
+                    .quad 0x3fd792a550000000
+                    .quad 0x3fd7bede00000000
+                    .quad 0x3fd7eaf830000000
+                    .quad 0x3fd816f410000000
+                    .quad 0x3fd842d1d0000000
+                    .quad 0x3fd86e9190000000
+                    .quad 0x3fd89a3380000000
+                    .quad 0x3fd8c5b7c0000000
+                    .quad 0x3fd8f11e80000000
+                    .quad 0x3fd91c67e0000000
+                    .quad 0x3fd9479410000000
+                    .quad 0x3fd972a340000000
+                    .quad 0x3fd99d9580000000
+                    .quad 0x3fd9c86b00000000
+                    .quad 0x3fd9f323e0000000
+                    .quad 0x3fda1dc060000000
+                    .quad 0x3fda484090000000
+                    .quad 0x3fda72a490000000
+                    .quad 0x3fda9cec90000000
+                    .quad 0x3fdac718c0000000
+                    .quad 0x3fdaf12930000000
+                    .quad 0x3fdb1b1e00000000
+                    .quad 0x3fdb44f770000000
+                    .quad 0x3fdb6eb590000000
+                    .quad 0x3fdb985890000000
+                    .quad 0x3fdbc1e080000000
+                    .quad 0x3fdbeb4d90000000
+                    .quad 0x3fdc149ff0000000
+                    .quad 0x3fdc3dd7a0000000
+                    .quad 0x3fdc66f4e0000000
+                    .quad 0x3fdc8ff7c0000000
+                    .quad 0x3fdcb8e070000000
+                    .quad 0x3fdce1af00000000
+                    .quad 0x3fdd0a63a0000000
+                    .quad 0x3fdd32fe70000000
+                    .quad 0x3fdd5b7f90000000
+                    .quad 0x3fdd83e720000000
+                    .quad 0x3fddac3530000000
+                    .quad 0x3fddd46a00000000
+                    .quad 0x3fddfc8590000000
+                    .quad 0x3fde248810000000
+                    .quad 0x3fde4c71a0000000
+                    .quad 0x3fde744260000000
+                    .quad 0x3fde9bfa60000000
+                    .quad 0x3fdec399d0000000
+                    .quad 0x3fdeeb20c0000000
+                    .quad 0x3fdf128f50000000
+                    .quad 0x3fdf39e5b0000000
+                    .quad 0x3fdf6123f0000000
+                    .quad 0x3fdf884a30000000
+                    .quad 0x3fdfaf5880000000
+                    .quad 0x3fdfd64f20000000
+                    .quad 0x3fdffd2e00000000
+                    .quad 0x3fe011fab0000000
+                    .quad 0x3fe02552a0000000
+                    .quad 0x3fe0389ee0000000
+                    .quad 0x3fe04bdf90000000
+                    .quad 0x3fe05f14b0000000
+                    .quad 0x3fe0723e50000000
+                    .quad 0x3fe0855c80000000
+                    .quad 0x3fe0986f40000000
+                    .quad 0x3fe0ab76b0000000
+                    .quad 0x3fe0be72e0000000
+                    .quad 0x3fe0d163c0000000
+                    .quad 0x3fe0e44980000000
+                    .quad 0x3fe0f72410000000
+                    .quad 0x3fe109f390000000
+                    .quad 0x3fe11cb810000000
+                    .quad 0x3fe12f7190000000
+                    .quad 0x3fe1422020000000
+                    .quad 0x3fe154c3d0000000
+                    .quad 0x3fe1675ca0000000
+                    .quad 0x3fe179eab0000000
+                    .quad 0x3fe18c6e00000000
+                    .quad 0x3fe19ee6b0000000
+                    .quad 0x3fe1b154b0000000
+                    .quad 0x3fe1c3b810000000
+                    .quad 0x3fe1d610f0000000
+                    .quad 0x3fe1e85f50000000
+                    .quad 0x3fe1faa340000000
+                    .quad 0x3fe20cdcd0000000
+                    .quad 0x3fe21f0bf0000000
+                    .quad 0x3fe23130d0000000
+                    .quad 0x3fe2434b60000000
+                    .quad 0x3fe2555bc0000000
+                    .quad 0x3fe2676200000000
+                    .quad 0x3fe2795e10000000
+                    .quad 0x3fe28b5000000000
+                    .quad 0x3fe29d37f0000000
+                    .quad 0x3fe2af15f0000000
+                    .quad 0x3fe2c0e9e0000000
+                    .quad 0x3fe2d2b400000000
+                    .quad 0x3fe2e47430000000
+                    .quad 0x3fe2f62a90000000
+                    .quad 0x3fe307d730000000
+                    .quad 0x3fe3197a00000000
+                    .quad 0x3fe32b1330000000
+                    .quad 0x3fe33ca2b0000000
+                    .quad 0x3fe34e2890000000
+                    .quad 0x3fe35fa4e0000000
+                    .quad 0x3fe37117b0000000
+                    .quad 0x3fe38280f0000000
+                    .quad 0x3fe393e0d0000000
+                    .quad 0x3fe3a53730000000
+                    .quad 0x3fe3b68440000000
+                    .quad 0x3fe3c7c7f0000000
+                    .quad 0x3fe3d90260000000
+                    .quad 0x3fe3ea3390000000
+                    .quad 0x3fe3fb5b80000000
+                    .quad 0x3fe40c7a40000000
+                    .quad 0x3fe41d8fe0000000
+                    .quad 0x3fe42e9c60000000
+                    .quad 0x3fe43f9fe0000000
+                    .quad 0x3fe4509a50000000
+                    .quad 0x3fe4618bc0000000
+                    .quad 0x3fe4727430000000
+                    .quad 0x3fe48353d0000000
+                    .quad 0x3fe4942a80000000
+                    .quad 0x3fe4a4f850000000
+                    .quad 0x3fe4b5bd60000000
+                    .quad 0x3fe4c679a0000000
+                    .quad 0x3fe4d72d30000000
+                    .quad 0x3fe4e7d810000000
+                    .quad 0x3fe4f87a30000000
+                    .quad 0x3fe50913c0000000
+                    .quad 0x3fe519a4c0000000
+                    .quad 0x3fe52a2d20000000
+                    .quad 0x3fe53aad00000000
+                    .quad 0x3fe54b2460000000
+                    .quad 0x3fe55b9350000000
+                    .quad 0x3fe56bf9d0000000
+                    .quad 0x3fe57c57f0000000
+                    .quad 0x3fe58cadb0000000
+                    .quad 0x3fe59cfb20000000
+                    .quad 0x3fe5ad4040000000
+                    .quad 0x3fe5bd7d30000000
+                    .quad 0x3fe5cdb1d0000000
+                    .quad 0x3fe5ddde50000000
+                    .quad 0x3fe5ee02a0000000
+                    .quad 0x3fe5fe1ed0000000
+                    .quad 0x3fe60e32f0000000
+                    .quad 0x3fe61e3ef0000000
+                    .quad 0x3fe62e42e0000000
+                    .quad 0x0000000000000000
+
+.align 16
+.L__log_256_tail:
+                    .quad 0x0000000000000000
+                    .quad 0x3db5885e0250435a
+                    .quad 0x3de620cf11f86ed2
+                    .quad 0x3dff0214edba4a25
+                    .quad 0x3dbf807c79f3db4e
+                    .quad 0x3dea352ba779a52b
+                    .quad 0x3dff56c46aa49fd5
+                    .quad 0x3dfebe465fef5196
+                    .quad 0x3e0cf0660099f1f8
+                    .quad 0x3e1247b2ff85945d
+                    .quad 0x3e13fd7abf5202b6
+                    .quad 0x3e1f91c9a918d51e
+                    .quad 0x3e08cb73f118d3ca
+                    .quad 0x3e1d91c7d6fad074
+                    .quad 0x3de1971bec28d14c
+                    .quad 0x3e15b616a423c78a
+                    .quad 0x3da162a6617cc971
+                    .quad 0x3e166391c4c06d29
+                    .quad 0x3e2d46f5c1d0c4b8
+                    .quad 0x3e2e14282df1f6d3
+                    .quad 0x3e186f47424a660d
+                    .quad 0x3e2d4c8de077753e
+                    .quad 0x3e2e0c307ed24f1c
+                    .quad 0x3e226ea18763bdd3
+                    .quad 0x3e25cad69737c933
+                    .quad 0x3e2af62599088901
+                    .quad 0x3e18c66c83d6b2d0
+                    .quad 0x3e1880ceb36fb30f
+                    .quad 0x3e2495aac6ca17a4
+                    .quad 0x3e2761db4210878c
+                    .quad 0x3e2eb78e862bac2f
+                    .quad 0x3e19b2cd75790dd9
+                    .quad 0x3e2c55e5cbd3d50f
+                    .quad 0x3db162a6617cc971
+                    .quad 0x3dfdbeabaaa2e519
+                    .quad 0x3e1652cb7150c647
+                    .quad 0x3e39a11cb2cd2ee2
+                    .quad 0x3e219d0ab1a28813
+                    .quad 0x3e24bd9e80a41811
+                    .quad 0x3e3214b596faa3df
+                    .quad 0x3e303fea46980bb8
+                    .quad 0x3e31c8ffa5fd28c7
+                    .quad 0x3dce8f743bcd96c5
+                    .quad 0x3dfd98c5395315c6
+                    .quad 0x3e3996fa3ccfa7b2
+                    .quad 0x3e1cd2af2ad13037
+                    .quad 0x3e1d0da1bd17200e
+                    .quad 0x3e3330410ba68b75
+                    .quad 0x3df4f27a790e7c41
+                    .quad 0x3e13956a86f6ff1b
+                    .quad 0x3e2c6748723551d9
+                    .quad 0x3e2500de9326cdfc
+                    .quad 0x3e1086c848df1b59
+                    .quad 0x3e04357ead6836ff
+                    .quad 0x3e24832442408024
+                    .quad 0x3e3d10da8154b13d
+                    .quad 0x3e39e8ad68ec8260
+                    .quad 0x3e3cfbf706abaf18
+                    .quad 0x3e3fc56ac6326e23
+                    .quad 0x3e39105e3185cf21
+                    .quad 0x3e3d017fe5b19cc0
+                    .quad 0x3e3d1f6b48dd13fe
+                    .quad 0x3e20b63358a7e73a
+                    .quad 0x3e263063028c211c
+                    .quad 0x3e2e6a6886b09760
+                    .quad 0x3e3c138bb891cd03
+                    .quad 0x3e369f7722b7221a
+                    .quad 0x3df57d8fac1a628c
+                    .quad 0x3e3c55e5cbd3d50f
+                    .quad 0x3e1552d2ff48fe2e
+                    .quad 0x3e37b8b26ca431bc
+                    .quad 0x3e292decdc1c5f6d
+                    .quad 0x3e3abc7c551aaa8c
+                    .quad 0x3e36b540731a354b
+                    .quad 0x3e32d341036b89ef
+                    .quad 0x3e4f9ab21a3a2e0f
+                    .quad 0x3e239c871afb9fbd
+                    .quad 0x3e3e6add2c81f640
+                    .quad 0x3e435c95aa313f41
+                    .quad 0x3e249d4582f6cc53
+                    .quad 0x3e47574c1c07398f
+                    .quad 0x3e4ba846dece9e8d
+                    .quad 0x3e16999fafbc68e7
+                    .quad 0x3e4c9145e51b0103
+                    .quad 0x3e479ef2cb44850a
+                    .quad 0x3e0beec73de11275
+                    .quad 0x3e2ef4351af5a498
+                    .quad 0x3e45713a493b4a50
+                    .quad 0x3e45c23a61385992
+                    .quad 0x3e42a88309f57299
+                    .quad 0x3e4530faa9ac8ace
+                    .quad 0x3e25fec2d792a758
+                    .quad 0x3e35a517a71cbcd7
+                    .quad 0x3e3707dc3e1cd9a3
+                    .quad 0x3e3a1a9f8ef43049
+                    .quad 0x3e4409d0276b3674
+                    .quad 0x3e20e2f613e85bd9
+                    .quad 0x3df0027433001e5f
+                    .quad 0x3e35dde2836d3265
+                    .quad 0x3e2300134d7aaf04
+                    .quad 0x3e3cb7e0b42724f5
+                    .quad 0x3e2d6e93167e6308
+                    .quad 0x3e3d1569b1526adb
+                    .quad 0x3e0e99fc338a1a41
+                    .quad 0x3e4eb01394a11b1c
+                    .quad 0x3e04f27a790e7c41
+                    .quad 0x3e25ce3ca97b7af9
+                    .quad 0x3e281f0f940ed857
+                    .quad 0x3e4d36295d88857c
+                    .quad 0x3e21aca1ec4af526
+                    .quad 0x3e445743c7182726
+                    .quad 0x3e23c491aead337e
+                    .quad 0x3e3aef401a738931
+                    .quad 0x3e21cede76092a29
+                    .quad 0x3e4fba8f44f82bb4
+                    .quad 0x3e446f5f7f3c3e1a
+                    .quad 0x3e47055f86c9674b
+                    .quad 0x3e4b41a92b6b6e1a
+                    .quad 0x3e443d162e927628
+                    .quad 0x3e4466174013f9b1
+                    .quad 0x3e3b05096ad69c62
+                    .quad 0x3e40b169150faa58
+                    .quad 0x3e3cd98b1df85da7
+                    .quad 0x3e468b507b0f8fa8
+                    .quad 0x3e48422df57499ba
+                    .quad 0x3e11351586970274
+                    .quad 0x3e117e08acba92ee
+                    .quad 0x3e26e04314dd0229
+                    .quad 0x3e497f3097e56d1a
+                    .quad 0x3e3356e655901286
+                    .quad 0x3e0cb761457f94d6
+                    .quad 0x3e39af67a85a9dac
+                    .quad 0x3e453410931a909f
+                    .quad 0x3e22c587206058f5
+                    .quad 0x3e223bc358899c22
+                    .quad 0x3e4d7bf8b6d223cb
+                    .quad 0x3e47991ec5197ddb
+                    .quad 0x3e4a79e6bb3a9219
+                    .quad 0x3e3a4c43ed663ec5
+                    .quad 0x3e461b5a1484f438
+                    .quad 0x3e4b4e36f7ef0c3a
+                    .quad 0x3e115f026acd0d1b
+                    .quad 0x3e3f36b535cecf05
+                    .quad 0x3e2ffb7fbf3eb5c6
+                    .quad 0x3e3e6a6886b09760
+                    .quad 0x3e3135eb27f5bbc3
+                    .quad 0x3e470be7d6f6fa57
+                    .quad 0x3e4ce43cc84ab338
+                    .quad 0x3e4c01d7aac3bd91
+                    .quad 0x3e45c58d07961060
+                    .quad 0x3e3628bcf941456e
+                    .quad 0x3e4c58b2a8461cd2
+                    .quad 0x3e33071282fb989a
+                    .quad 0x3e420dab6a80f09c
+                    .quad 0x3e44f8d84c397b1e
+                    .quad 0x3e40d0ee08599e48
+                    .quad 0x3e1d68787e37da36
+                    .quad 0x3e366187d591bafc
+                    .quad 0x3e22346600bae772
+                    .quad 0x3e390377d0d61b8e
+                    .quad 0x3e4f5e0dd966b907
+                    .quad 0x3e49023cb79a00e2
+                    .quad 0x3e44e05158c28ad8
+                    .quad 0x3e3bfa7b08b18ae4
+                    .quad 0x3e4ef1e63db35f67
+                    .quad 0x3e0ec2ae39493d4f
+                    .quad 0x3e40afe930ab2fa0
+                    .quad 0x3e225ff8a1810dd4
+                    .quad 0x3e469743fb1a71a5
+                    .quad 0x3e5f9cc676785571
+                    .quad 0x3e5b524da4cbf982
+                    .quad 0x3e5a4c8b381535b8
+                    .quad 0x3e5839be809caf2c
+                    .quad 0x3e50968a1cb82c13
+                    .quad 0x3e5eae6a41723fb5
+                    .quad 0x3e5d9c29a380a4db
+                    .quad 0x3e4094aa0ada625e
+                    .quad 0x3e5973ad6fc108ca
+                    .quad 0x3e4747322fdbab97
+                    .quad 0x3e593692fa9d4221
+                    .quad 0x3e5c5a992dfbc7d9
+                    .quad 0x3e4e1f33e102387a
+                    .quad 0x3e464fbef14c048c
+                    .quad 0x3e4490f513ca5e3b
+                    .quad 0x3e37a6af4d4c799d
+                    .quad 0x3e57574c1c07398f
+                    .quad 0x3e57b133417f8c1c
+                    .quad 0x3e5feb9e0c176514
+                    .quad 0x3e419f25bb3172f7
+                    .quad 0x3e45f68a7bbfb852
+                    .quad 0x3e5ee278497929f1
+                    .quad 0x3e5ccee006109d58
+                    .quad 0x3e5ce081a07bd8b3
+                    .quad 0x3e570e12981817b8
+                    .quad 0x3e292ab6d93503d0
+                    .quad 0x3e58cb7dd7c3b61e
+                    .quad 0x3e4efafd0a0b78da
+                    .quad 0x3e5e907267c4288e
+                    .quad 0x3e5d31ef96780875
+                    .quad 0x3e23430dfcd2ad50
+                    .quad 0x3e344d88d75bc1f9
+                    .quad 0x3e5bec0f055e04fc
+                    .quad 0x3e5d85611590b9ad
+                    .quad 0x3df320568e583229
+                    .quad 0x3e5a891d1772f538
+                    .quad 0x3e22edc9dabba74d
+                    .quad 0x3e4b9009a1015086
+                    .quad 0x3e52a12a8c5b1a19
+                    .quad 0x3e3a7885f0fdac85
+                    .quad 0x3e5f4ffcd43ac691
+                    .quad 0x3e52243ae2640aad
+                    .quad 0x3e546513299035d3
+                    .quad 0x3e5b39c3a62dd725
+                    .quad 0x3e5ba6dd40049f51
+                    .quad 0x3e451d1ed7177409
+                    .quad 0x3e5cb0f2fd7f5216
+                    .quad 0x3e3ab150cd4e2213
+                    .quad 0x3e5cfd7bf3193844
+                    .quad 0x3e53fff8455f1dbd
+                    .quad 0x3e5fee640b905fc9
+                    .quad 0x3e54e2adf548084c
+                    .quad 0x3e3b597adc1ecdd2
+                    .quad 0x3e4345bd096d3a75
+                    .quad 0x3e5101b9d2453c8b
+                    .quad 0x3e508ce55cc8c979
+                    .quad 0x3e5bbf017e595f71
+                    .quad 0x3e37ce733bd393dc
+                    .quad 0x3e233bb0a503f8a1
+                    .quad 0x3e30e2f613e85bd9
+                    .quad 0x3e5e67555a635b3c
+                    .quad 0x3e2ea88df73d5e8b
+                    .quad 0x3e3d17e03bda18a8
+                    .quad 0x3e5b607d76044f7e
+                    .quad 0x3e52adc4e71bc2fc
+                    .quad 0x3e5f99dc7362d1d9
+                    .quad 0x3e5473fa008e6a6a
+                    .quad 0x3e2b75bb09cb0985
+                    .quad 0x3e5ea04dd10b9aba
+                    .quad 0x3e5802d0d6979674
+                    .quad 0x3e174688ccd99094
+                    .quad 0x3e496f16abb9df22
+                    .quad 0x3e46e66df2aa374f
+                    .quad 0x3e4e66525ea4550a
+                    .quad 0x3e42d02f34f20cbd
+                    .quad 0x3e46cfce65047188
+                    .quad 0x3e39b78c842d58b8
+                    .quad 0x3e4735e624c24bc9
+                    .quad 0x3e47eba1f7dd1adf
+                    .quad 0x3e586b3e59f65355
+                    .quad 0x3e1ce38e637f1b4d
+                    .quad 0x3e58d82ec919edc7
+                    .quad 0x3e4c52648ddcfa37
+                    .quad 0x3e52482ceae1ac12
+                    .quad 0x3e55a312311aba4f
+                    .quad 0x3e411e236329f225
+                    .quad 0x3e5b48c8cd2f246c
+                    .quad 0x3e6efa39ef35793c
+                    .quad 0x0000000000000000
+
+.align 16
+.L__log_F_inv:
+                    .quad 0x4000000000000000
+                    .quad 0x3fffe01fe01fe020
+                    .quad 0x3fffc07f01fc07f0
+                    .quad 0x3fffa11caa01fa12
+                    .quad 0x3fff81f81f81f820
+                    .quad 0x3fff6310aca0dbb5
+                    .quad 0x3fff44659e4a4271
+                    .quad 0x3fff25f644230ab5
+                    .quad 0x3fff07c1f07c1f08
+                    .quad 0x3ffee9c7f8458e02
+                    .quad 0x3ffecc07b301ecc0
+                    .quad 0x3ffeae807aba01eb
+                    .quad 0x3ffe9131abf0b767
+                    .quad 0x3ffe741aa59750e4
+                    .quad 0x3ffe573ac901e574
+                    .quad 0x3ffe3a9179dc1a73
+                    .quad 0x3ffe1e1e1e1e1e1e
+                    .quad 0x3ffe01e01e01e01e
+                    .quad 0x3ffde5d6e3f8868a
+                    .quad 0x3ffdca01dca01dca
+                    .quad 0x3ffdae6076b981db
+                    .quad 0x3ffd92f2231e7f8a
+                    .quad 0x3ffd77b654b82c34
+                    .quad 0x3ffd5cac807572b2
+                    .quad 0x3ffd41d41d41d41d
+                    .quad 0x3ffd272ca3fc5b1a
+                    .quad 0x3ffd0cb58f6ec074
+                    .quad 0x3ffcf26e5c44bfc6
+                    .quad 0x3ffcd85689039b0b
+                    .quad 0x3ffcbe6d9601cbe7
+                    .quad 0x3ffca4b3055ee191
+                    .quad 0x3ffc8b265afb8a42
+                    .quad 0x3ffc71c71c71c71c
+                    .quad 0x3ffc5894d10d4986
+                    .quad 0x3ffc3f8f01c3f8f0
+                    .quad 0x3ffc26b5392ea01c
+                    .quad 0x3ffc0e070381c0e0
+                    .quad 0x3ffbf583ee868d8b
+                    .quad 0x3ffbdd2b899406f7
+                    .quad 0x3ffbc4fd65883e7b
+                    .quad 0x3ffbacf914c1bad0
+                    .quad 0x3ffb951e2b18ff23
+                    .quad 0x3ffb7d6c3dda338b
+                    .quad 0x3ffb65e2e3beee05
+                    .quad 0x3ffb4e81b4e81b4f
+                    .quad 0x3ffb37484ad806ce
+                    .quad 0x3ffb2036406c80d9
+                    .quad 0x3ffb094b31d922a4
+                    .quad 0x3ffaf286bca1af28
+                    .quad 0x3ffadbe87f94905e
+                    .quad 0x3ffac5701ac5701b
+                    .quad 0x3ffaaf1d2f87ebfd
+                    .quad 0x3ffa98ef606a63be
+                    .quad 0x3ffa82e65130e159
+                    .quad 0x3ffa6d01a6d01a6d
+                    .quad 0x3ffa574107688a4a
+                    .quad 0x3ffa41a41a41a41a
+                    .quad 0x3ffa2c2a87c51ca0
+                    .quad 0x3ffa16d3f97a4b02
+                    .quad 0x3ffa01a01a01a01a
+                    .quad 0x3ff9ec8e951033d9
+                    .quad 0x3ff9d79f176b682d
+                    .quad 0x3ff9c2d14ee4a102
+                    .quad 0x3ff9ae24ea5510da
+                    .quad 0x3ff999999999999a
+                    .quad 0x3ff9852f0d8ec0ff
+                    .quad 0x3ff970e4f80cb872
+                    .quad 0x3ff95cbb0be377ae
+                    .quad 0x3ff948b0fcd6e9e0
+                    .quad 0x3ff934c67f9b2ce6
+                    .quad 0x3ff920fb49d0e229
+                    .quad 0x3ff90d4f120190d5
+                    .quad 0x3ff8f9c18f9c18fa
+                    .quad 0x3ff8e6527af1373f
+                    .quad 0x3ff8d3018d3018d3
+                    .quad 0x3ff8bfce8062ff3a
+                    .quad 0x3ff8acb90f6bf3aa
+                    .quad 0x3ff899c0f601899c
+                    .quad 0x3ff886e5f0abb04a
+                    .quad 0x3ff87427bcc092b9
+                    .quad 0x3ff8618618618618
+                    .quad 0x3ff84f00c2780614
+                    .quad 0x3ff83c977ab2bedd
+                    .quad 0x3ff82a4a0182a4a0
+                    .quad 0x3ff8181818181818
+                    .quad 0x3ff8060180601806
+                    .quad 0x3ff7f405fd017f40
+                    .quad 0x3ff7e225515a4f1d
+                    .quad 0x3ff7d05f417d05f4
+                    .quad 0x3ff7beb3922e017c
+                    .quad 0x3ff7ad2208e0ecc3
+                    .quad 0x3ff79baa6bb6398b
+                    .quad 0x3ff78a4c8178a4c8
+                    .quad 0x3ff77908119ac60d
+                    .quad 0x3ff767dce434a9b1
+                    .quad 0x3ff756cac201756d
+                    .quad 0x3ff745d1745d1746
+                    .quad 0x3ff734f0c541fe8d
+                    .quad 0x3ff724287f46debc
+                    .quad 0x3ff713786d9c7c09
+                    .quad 0x3ff702e05c0b8170
+                    .quad 0x3ff6f26016f26017
+                    .quad 0x3ff6e1f76b4337c7
+                    .quad 0x3ff6d1a62681c861
+                    .quad 0x3ff6c16c16c16c17
+                    .quad 0x3ff6b1490aa31a3d
+                    .quad 0x3ff6a13cd1537290
+                    .quad 0x3ff691473a88d0c0
+                    .quad 0x3ff6816816816817
+                    .quad 0x3ff6719f3601671a
+                    .quad 0x3ff661ec6a5122f9
+                    .quad 0x3ff6524f853b4aa3
+                    .quad 0x3ff642c8590b2164
+                    .quad 0x3ff63356b88ac0de
+                    .quad 0x3ff623fa77016240
+                    .quad 0x3ff614b36831ae94
+                    .quad 0x3ff6058160581606
+                    .quad 0x3ff5f66434292dfc
+                    .quad 0x3ff5e75bb8d015e7
+                    .quad 0x3ff5d867c3ece2a5
+                    .quad 0x3ff5c9882b931057
+                    .quad 0x3ff5babcc647fa91
+                    .quad 0x3ff5ac056b015ac0
+                    .quad 0x3ff59d61f123ccaa
+                    .quad 0x3ff58ed2308158ed
+                    .quad 0x3ff5805601580560
+                    .quad 0x3ff571ed3c506b3a
+                    .quad 0x3ff56397ba7c52e2
+                    .quad 0x3ff5555555555555
+                    .quad 0x3ff54725e6bb82fe
+                    .quad 0x3ff5390948f40feb
+                    .quad 0x3ff52aff56a8054b
+                    .quad 0x3ff51d07eae2f815
+                    .quad 0x3ff50f22e111c4c5
+                    .quad 0x3ff5015015015015
+                    .quad 0x3ff4f38f62dd4c9b
+                    .quad 0x3ff4e5e0a72f0539
+                    .quad 0x3ff4d843bedc2c4c
+                    .quad 0x3ff4cab88725af6e
+                    .quad 0x3ff4bd3edda68fe1
+                    .quad 0x3ff4afd6a052bf5b
+                    .quad 0x3ff4a27fad76014a
+                    .quad 0x3ff49539e3b2d067
+                    .quad 0x3ff4880522014880
+                    .quad 0x3ff47ae147ae147b
+                    .quad 0x3ff46dce34596066
+                    .quad 0x3ff460cbc7f5cf9a
+                    .quad 0x3ff453d9e2c776ca
+                    .quad 0x3ff446f86562d9fb
+                    .quad 0x3ff43a2730abee4d
+                    .quad 0x3ff42d6625d51f87
+                    .quad 0x3ff420b5265e5951
+                    .quad 0x3ff4141414141414
+                    .quad 0x3ff40782d10e6566
+                    .quad 0x3ff3fb013fb013fb
+                    .quad 0x3ff3ee8f42a5af07
+                    .quad 0x3ff3e22cbce4a902
+                    .quad 0x3ff3d5d991aa75c6
+                    .quad 0x3ff3c995a47babe7
+                    .quad 0x3ff3bd60d9232955
+                    .quad 0x3ff3b13b13b13b14
+                    .quad 0x3ff3a524387ac822
+                    .quad 0x3ff3991c2c187f63
+                    .quad 0x3ff38d22d366088e
+                    .quad 0x3ff3813813813814
+                    .quad 0x3ff3755bd1c945ee
+                    .quad 0x3ff3698df3de0748
+                    .quad 0x3ff35dce5f9f2af8
+                    .quad 0x3ff3521cfb2b78c1
+                    .quad 0x3ff34679ace01346
+                    .quad 0x3ff33ae45b57bcb2
+                    .quad 0x3ff32f5ced6a1dfa
+                    .quad 0x3ff323e34a2b10bf
+                    .quad 0x3ff3187758e9ebb6
+                    .quad 0x3ff30d190130d190
+                    .quad 0x3ff301c82ac40260
+                    .quad 0x3ff2f684bda12f68
+                    .quad 0x3ff2eb4ea1fed14b
+                    .quad 0x3ff2e025c04b8097
+                    .quad 0x3ff2d50a012d50a0
+                    .quad 0x3ff2c9fb4d812ca0
+                    .quad 0x3ff2bef98e5a3711
+                    .quad 0x3ff2b404ad012b40
+                    .quad 0x3ff2a91c92f3c105
+                    .quad 0x3ff29e4129e4129e
+                    .quad 0x3ff293725bb804a5
+                    .quad 0x3ff288b01288b013
+                    .quad 0x3ff27dfa38a1ce4d
+                    .quad 0x3ff27350b8812735
+                    .quad 0x3ff268b37cd60127
+                    .quad 0x3ff25e22708092f1
+                    .quad 0x3ff2539d7e9177b2
+                    .quad 0x3ff2492492492492
+                    .quad 0x3ff23eb79717605b
+                    .quad 0x3ff23456789abcdf
+                    .quad 0x3ff22a0122a0122a
+                    .quad 0x3ff21fb78121fb78
+                    .quad 0x3ff21579804855e6
+                    .quad 0x3ff20b470c67c0d9
+                    .quad 0x3ff2012012012012
+                    .quad 0x3ff1f7047dc11f70
+                    .quad 0x3ff1ecf43c7fb84c
+                    .quad 0x3ff1e2ef3b3fb874
+                    .quad 0x3ff1d8f5672e4abd
+                    .quad 0x3ff1cf06ada2811d
+                    .quad 0x3ff1c522fc1ce059
+                    .quad 0x3ff1bb4a4046ed29
+                    .quad 0x3ff1b17c67f2bae3
+                    .quad 0x3ff1a7b9611a7b96
+                    .quad 0x3ff19e0119e0119e
+                    .quad 0x3ff19453808ca29c
+                    .quad 0x3ff18ab083902bdb
+                    .quad 0x3ff1811811811812
+                    .quad 0x3ff1778a191bd684
+                    .quad 0x3ff16e0689427379
+                    .quad 0x3ff1648d50fc3201
+                    .quad 0x3ff15b1e5f75270d
+                    .quad 0x3ff151b9a3fdd5c9
+                    .quad 0x3ff1485f0e0acd3b
+                    .quad 0x3ff13f0e8d344724
+                    .quad 0x3ff135c81135c811
+                    .quad 0x3ff12c8b89edc0ac
+                    .quad 0x3ff12358e75d3033
+                    .quad 0x3ff11a3019a74826
+                    .quad 0x3ff1111111111111
+                    .quad 0x3ff107fbbe011080
+                    .quad 0x3ff0fef010fef011
+                    .quad 0x3ff0f5edfab325a2
+                    .quad 0x3ff0ecf56be69c90
+                    .quad 0x3ff0e40655826011
+                    .quad 0x3ff0db20a88f4696
+                    .quad 0x3ff0d24456359e3a
+                    .quad 0x3ff0c9714fbcda3b
+                    .quad 0x3ff0c0a7868b4171
+                    .quad 0x3ff0b7e6ec259dc8
+                    .quad 0x3ff0af2f722eecb5
+                    .quad 0x3ff0a6810a6810a7
+                    .quad 0x3ff09ddba6af8360
+                    .quad 0x3ff0953f39010954
+                    .quad 0x3ff08cabb37565e2
+                    .quad 0x3ff0842108421084
+                    .quad 0x3ff07b9f29b8eae2
+                    .quad 0x3ff073260a47f7c6
+                    .quad 0x3ff06ab59c7912fb
+                    .quad 0x3ff0624dd2f1a9fc
+                    .quad 0x3ff059eea0727586
+                    .quad 0x3ff05197f7d73404
+                    .quad 0x3ff04949cc1664c5
+                    .quad 0x3ff0410410410410
+                    .quad 0x3ff038c6b78247fc
+                    .quad 0x3ff03091b51f5e1a
+                    .quad 0x3ff02864fc7729e9
+                    .quad 0x3ff0204081020408
+                    .quad 0x3ff0182436517a37
+                    .quad 0x3ff0101010101010
+                    .quad 0x3ff0080402010080
+                    .quad 0x3ff0000000000000
+                    .quad 0x0000000000000000
+
+#endif

diff --git a/src/gas/log10.S b/src/gas/log10.S
new file mode 100644
index 0000000..90522ef
--- /dev/null
+++ b/src/gas/log10.S

@@ -0,0 +1,1146 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# log10.S
+#
+# An implementation of the log10 libm function.
+#
+# Prototype:
+#
+#     double log10(double x);
+#
+
+#
+#   Algorithm:
+#       Similar to one presnted in log.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(log10)
+#define fname_special _log10_special@PLT
+
+
+# local variable storage offsets
+.equ    p_temp, 0x0
+.equ    stack_size, 0x18
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+    sub         $stack_size, %rsp
+
+    # compute exponent part
+    xor         %rax, %rax
+    movdqa      %xmm0, %xmm3
+    movsd       %xmm0, %xmm4
+    psrlq       $52, %xmm3
+    movd        %xmm0, %rax
+    psubq       .L__mask_1023(%rip), %xmm3
+    movdqa      %xmm0, %xmm2
+    cvtdq2pd    %xmm3, %xmm6 # xexp
+
+    #  NaN or inf
+    movdqa      %xmm0, %xmm5
+    andpd       .L__real_inf(%rip), %xmm5
+    comisd      .L__real_inf(%rip), %xmm5
+    je          .L__x_is_inf_or_nan
+
+    # check for negative numbers or zero
+    xorpd       %xmm5, %xmm5
+    comisd      %xmm5, %xmm0
+    jbe         .L__x_is_zero_or_neg
+
+    pand        .L__real_mant(%rip), %xmm2
+    subsd       .L__real_one(%rip), %xmm4
+
+    comisd      .L__mask_1023_f(%rip), %xmm6
+    je          .L__denormal_adjust
+
+.L__continue_common:    
+
+    # compute index into the log tables
+    mov         %rax, %r9
+    and         .L__mask_mant_all8(%rip), %rax
+    and         .L__mask_mant9(%rip), %r9
+    shl         $1, %r9
+    add         %r9, %rax
+    mov         %rax, p_temp(%rsp)
+
+    # near one codepath
+    andpd       .L__real_notsign(%rip), %xmm4
+    comisd      .L__real_threshold(%rip), %xmm4
+    jb          .L__near_one
+
+    # F, Y
+    movsd       p_temp(%rsp), %xmm1
+    shr         $44, %rax
+    por         .L__real_half(%rip), %xmm2
+    por         .L__real_half(%rip), %xmm1
+    lea         .L__log_F_inv(%rip), %r9
+
+    # f = F - Y, r = f * inv
+    subsd       %xmm2, %xmm1
+    mulsd       (%r9,%rax,8), %xmm1
+
+    movsd       %xmm1, %xmm2
+    movsd       %xmm1, %xmm0
+    lea         .L__log_256_lead(%rip), %r9
+
+    # poly
+    movsd       .L__real_1_over_6(%rip), %xmm3
+    movsd       .L__real_1_over_3(%rip), %xmm1
+    mulsd       %xmm2, %xmm3                         
+    mulsd       %xmm2, %xmm1                         
+    mulsd       %xmm2, %xmm0                         
+    movsd       %xmm0, %xmm4
+    addsd       .L__real_1_over_5(%rip), %xmm3
+    addsd       .L__real_1_over_2(%rip), %xmm1
+    mulsd       %xmm0, %xmm4                         
+    mulsd       %xmm2, %xmm3                         
+    mulsd       %xmm0, %xmm1                         
+    addsd       .L__real_1_over_4(%rip), %xmm3
+    addsd       %xmm2, %xmm1                         
+    mulsd       %xmm4, %xmm3                         
+    addsd       %xmm3, %xmm1                         
+
+    mulsd       .L__real_log10_e(%rip), %xmm1
+
+    # m*log(10) + log10(G) - poly
+    movsd       .L__real_log10_2_tail(%rip), %xmm5
+    mulsd       %xmm6, %xmm5
+    subsd       %xmm1, %xmm5
+
+    movsd       (%r9,%rax,8), %xmm0
+    lea         .L__log_256_tail(%rip), %rdx
+    movsd       (%rdx,%rax,8), %xmm2
+    addsd       %xmm5, %xmm2
+
+    movsd       .L__real_log10_2_lead(%rip), %xmm4
+    mulsd       %xmm6, %xmm4
+    addsd       %xmm4, %xmm0
+
+    addsd       %xmm2, %xmm0
+
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__near_one:
+
+    # r = x - 1.0
+    movsd       .L__real_two(%rip), %xmm2
+    subsd       .L__real_one(%rip), %xmm0 # r
+
+    addsd       %xmm0, %xmm2
+    movsd       %xmm0, %xmm1
+    divsd       %xmm2, %xmm1 # r/(2+r) = u/2
+
+    movsd       .L__real_ca2(%rip), %xmm4
+    movsd       .L__real_ca4(%rip), %xmm5
+
+    movsd       %xmm0, %xmm6
+    mulsd       %xmm1, %xmm6 # correction
+
+    addsd       %xmm1, %xmm1 # u
+    movsd       %xmm1, %xmm2
+
+    mulsd       %xmm1, %xmm2 # u^2
+
+    mulsd       %xmm2, %xmm4
+    mulsd       %xmm2, %xmm5
+
+    addsd       .L__real_ca1(%rip), %xmm4
+    addsd       .L__real_ca3(%rip), %xmm5
+
+    mulsd       %xmm1, %xmm2 # u^3
+    mulsd       %xmm2, %xmm4
+
+    mulsd       %xmm2, %xmm2
+    mulsd       %xmm1, %xmm2 # u^7
+    mulsd       %xmm2, %xmm5
+
+    addsd       %xmm5, %xmm4
+    subsd       %xmm6, %xmm4
+
+    movdqa      %xmm0, %xmm3
+    pand        .L__mask_lower(%rip), %xmm3
+    subsd       %xmm3, %xmm0
+    addsd       %xmm0, %xmm4
+
+    movsd       %xmm3, %xmm0
+    movsd       %xmm4, %xmm1
+
+    mulsd       .L__real_log10_e_tail(%rip), %xmm4
+    mulsd       .L__real_log10_e_tail(%rip), %xmm0
+    mulsd       .L__real_log10_e_lead(%rip), %xmm1
+    mulsd       .L__real_log10_e_lead(%rip), %xmm3
+
+    addsd       %xmm4, %xmm0
+    addsd       %xmm1, %xmm0
+    addsd       %xmm3, %xmm0
+
+    add         $stack_size, %rsp
+    ret
+
+.L__denormal_adjust:
+    por         .L__real_one(%rip), %xmm2
+    subsd       .L__real_one(%rip), %xmm2
+    movsd       %xmm2, %xmm5
+    pand        .L__real_mant(%rip), %xmm2
+    movd        %xmm2, %rax
+    psrlq       $52, %xmm5
+    psubd       .L__mask_2045(%rip), %xmm5
+    cvtdq2pd    %xmm5, %xmm6
+    jmp         .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+    jne         .L__x_is_neg
+
+    movsd       .L__real_ninf(%rip), %xmm1
+    mov         .L__flag_x_zero(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+    movsd       .L__real_qnan(%rip), %xmm1
+    mov         .L__flag_x_neg(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+    cmp         .L__real_inf(%rip), %rax
+    je          .L__finish
+
+    cmp         .L__real_ninf(%rip), %rax
+    je          .L__x_is_neg
+
+    mov         .L__real_qnanbit(%rip), %r9
+    and         %rax, %r9
+    jnz         .L__finish
+
+    or          .L__real_qnanbit(%rip), %rax
+    movd        %rax, %xmm1
+    mov         .L__flag_x_nan(%rip), %edi
+    call        fname_special
+    jmp         .L__finish    
+
+.p2align 4,,15
+.L__finish:
+    add         $stack_size, %rsp
+    ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero:        .long 00000001
+.L__flag_x_neg:         .long 00000002
+.L__flag_x_nan:         .long 00000003
+
+.align 16
+
+.L__real_ninf:      .quad 0x0fff0000000000000   # -inf
+                    .quad 0x0000000000000000
+.L__real_inf:       .quad 0x7ff0000000000000    # +inf
+                    .quad 0x0000000000000000
+.L__real_qnan:      .quad 0x7ff8000000000000   # qNaN
+                    .quad 0x0000000000000000
+.L__real_qnanbit:   .quad 0x0008000000000000
+                    .quad 0x0000000000000000
+.L__real_mant:      .quad 0x000FFFFFFFFFFFFF    # mantissa bits
+                    .quad 0x0000000000000000
+.L__mask_1023:      .quad 0x00000000000003ff
+                    .quad 0x0000000000000000
+.L__mask_001:       .quad 0x0000000000000001
+                    .quad 0x0000000000000000
+
+.L__mask_mant_all8:     .quad 0x000ff00000000000
+                        .quad 0x0000000000000000
+.L__mask_mant9:         .quad 0x0000080000000000
+                        .quad 0x0000000000000000
+
+.L__real_log10_e:       .quad 0x3fdbcb7b1526e50e
+                        .quad 0x0000000000000000
+
+.L__real_log10_e_lead:  .quad 0x3fdbcb7800000000 # log10e_lead 4.34293746948242187500e-01
+                        .quad 0x0000000000000000
+.L__real_log10_e_tail:  .quad 0x3ea8a93728719535 # log10e_tail 7.3495500964015109100644e-7
+                        .quad 0x0000000000000000
+
+.L__real_log10_2_lead:  .quad 0x3fd3441350000000
+                        .quad 0x0000000000000000
+.L__real_log10_2_tail:  .quad 0x3e03ef3fde623e25
+                        .quad 0x0000000000000000
+
+
+
+
+.L__real_two:       .quad 0x4000000000000000 # 2
+                    .quad 0x0000000000000000
+
+.L__real_one:       .quad 0x3ff0000000000000 # 1
+                    .quad 0x0000000000000000
+
+.L__real_half:      .quad 0x3fe0000000000000 # 1/2
+                    .quad 0x0000000000000000
+
+.L__mask_100:       .quad 0x0000000000000100
+                    .quad 0x0000000000000000
+
+.L__real_1_over_512:    .quad 0x3f60000000000000
+                        .quad 0x0000000000000000
+
+.L__real_1_over_2:  .quad 0x3fe0000000000000
+                    .quad 0x0000000000000000
+.L__real_1_over_3:  .quad 0x3fd5555555555555
+                    .quad 0x0000000000000000
+.L__real_1_over_4:  .quad 0x3fd0000000000000
+                    .quad 0x0000000000000000
+.L__real_1_over_5:  .quad 0x3fc999999999999a
+                    .quad 0x0000000000000000
+.L__real_1_over_6:  .quad 0x3fc5555555555555
+                    .quad 0x0000000000000000
+
+.L__mask_1023_f:    .quad 0x0c08ff80000000000
+                    .quad 0x0000000000000000
+
+.L__mask_2045:      .quad 0x00000000000007fd
+                    .quad 0x0000000000000000
+
+.L__real_threshold: .quad 0x3fb0000000000000 # .0625
+                    .quad 0x0000000000000000
+
+.L__real_notsign:   .quad 0x7ffFFFFFFFFFFFFF # ^sign bit
+                    .quad 0x0000000000000000
+
+.L__real_ca1:       .quad 0x3fb55555555554e6 # 8.33333333333317923934e-02
+                    .quad 0x0000000000000000
+.L__real_ca2:       .quad 0x3f89999999bac6d4 # 1.25000000037717509602e-02
+                    .quad 0x0000000000000000
+.L__real_ca3:       .quad 0x3f62492307f1519f # 2.23213998791944806202e-03
+                    .quad 0x0000000000000000
+.L__real_ca4:       .quad 0x3f3c8034c85dfff0 # 4.34887777707614552256e-04
+                    .quad 0x0000000000000000
+
+.L__mask_lower:     .quad 0x0ffffffff00000000
+                    .quad 0x0000000000000000
+
+.align 16
+.L__log_256_lead:
+                    .quad 0x0000000000000000
+                    .quad 0x3f5bbd9e90000000
+                    .quad 0x3f6bafd470000000
+                    .quad 0x3f74b99560000000
+                    .quad 0x3f7b9476a0000000
+                    .quad 0x3f81344da0000000
+                    .quad 0x3f849b0850000000
+                    .quad 0x3f87fe71c0000000
+                    .quad 0x3f8b5e9080000000
+                    .quad 0x3f8ebb6af0000000
+                    .quad 0x3f910a83a0000000
+                    .quad 0x3f92b5b5e0000000
+                    .quad 0x3f945f4f50000000
+                    .quad 0x3f96075300000000
+                    .quad 0x3f97adc3d0000000
+                    .quad 0x3f9952a4f0000000
+                    .quad 0x3f9af5f920000000
+                    .quad 0x3f9c97c370000000
+                    .quad 0x3f9e3806a0000000
+                    .quad 0x3f9fd6c5b0000000
+                    .quad 0x3fa0ba01a0000000
+                    .quad 0x3fa187e120000000
+                    .quad 0x3fa25502c0000000
+                    .quad 0x3fa32167c0000000
+                    .quad 0x3fa3ed1190000000
+                    .quad 0x3fa4b80180000000
+                    .quad 0x3fa58238e0000000
+                    .quad 0x3fa64bb910000000
+                    .quad 0x3fa7148340000000
+                    .quad 0x3fa7dc98c0000000
+                    .quad 0x3fa8a3fad0000000
+                    .quad 0x3fa96aaac0000000
+                    .quad 0x3faa30a9d0000000
+                    .quad 0x3faaf5f920000000
+                    .quad 0x3fabba9a00000000
+                    .quad 0x3fac7e8d90000000
+                    .quad 0x3fad41d510000000
+                    .quad 0x3fae0471a0000000
+                    .quad 0x3faec66470000000
+                    .quad 0x3faf87aeb0000000
+                    .quad 0x3fb02428c0000000
+                    .quad 0x3fb08426f0000000
+                    .quad 0x3fb0e3d290000000
+                    .quad 0x3fb1432c30000000
+                    .quad 0x3fb1a23440000000
+                    .quad 0x3fb200eb60000000
+                    .quad 0x3fb25f5210000000
+                    .quad 0x3fb2bd68e0000000
+                    .quad 0x3fb31b3050000000
+                    .quad 0x3fb378a8e0000000
+                    .quad 0x3fb3d5d330000000
+                    .quad 0x3fb432afa0000000
+                    .quad 0x3fb48f3ed0000000
+                    .quad 0x3fb4eb8120000000
+                    .quad 0x3fb5477730000000
+                    .quad 0x3fb5a32160000000
+                    .quad 0x3fb5fe8040000000
+                    .quad 0x3fb6599440000000
+                    .quad 0x3fb6b45df0000000
+                    .quad 0x3fb70eddb0000000
+                    .quad 0x3fb7691400000000
+                    .quad 0x3fb7c30160000000
+                    .quad 0x3fb81ca630000000
+                    .quad 0x3fb8760300000000
+                    .quad 0x3fb8cf1830000000
+                    .quad 0x3fb927e640000000
+                    .quad 0x3fb9806d90000000
+                    .quad 0x3fb9d8aea0000000
+                    .quad 0x3fba30a9d0000000
+                    .quad 0x3fba885fa0000000
+                    .quad 0x3fbadfd070000000
+                    .quad 0x3fbb36fcb0000000
+                    .quad 0x3fbb8de4d0000000
+                    .quad 0x3fbbe48930000000
+                    .quad 0x3fbc3aea40000000
+                    .quad 0x3fbc910870000000
+                    .quad 0x3fbce6e410000000
+                    .quad 0x3fbd3c7da0000000
+                    .quad 0x3fbd91d580000000
+                    .quad 0x3fbde6ec00000000
+                    .quad 0x3fbe3bc1a0000000
+                    .quad 0x3fbe9056b0000000
+                    .quad 0x3fbee4aba0000000
+                    .quad 0x3fbf38c0c0000000
+                    .quad 0x3fbf8c9680000000
+                    .quad 0x3fbfe02d30000000
+                    .quad 0x3fc019c2a0000000
+                    .quad 0x3fc0434f70000000
+                    .quad 0x3fc06cbd60000000
+                    .quad 0x3fc0960c80000000
+                    .quad 0x3fc0bf3d00000000
+                    .quad 0x3fc0e84f10000000
+                    .quad 0x3fc11142f0000000
+                    .quad 0x3fc13a18a0000000
+                    .quad 0x3fc162d080000000
+                    .quad 0x3fc18b6a90000000
+                    .quad 0x3fc1b3e710000000
+                    .quad 0x3fc1dc4630000000
+                    .quad 0x3fc2048810000000
+                    .quad 0x3fc22cace0000000
+                    .quad 0x3fc254b4d0000000
+                    .quad 0x3fc27c9ff0000000
+                    .quad 0x3fc2a46e80000000
+                    .quad 0x3fc2cc20b0000000
+                    .quad 0x3fc2f3b690000000
+                    .quad 0x3fc31b3050000000
+                    .quad 0x3fc3428e20000000
+                    .quad 0x3fc369d020000000
+                    .quad 0x3fc390f680000000
+                    .quad 0x3fc3b80160000000
+                    .quad 0x3fc3def0e0000000
+                    .quad 0x3fc405c530000000
+                    .quad 0x3fc42c7e70000000
+                    .quad 0x3fc4531cd0000000
+                    .quad 0x3fc479a070000000
+                    .quad 0x3fc4a00970000000
+                    .quad 0x3fc4c65800000000
+                    .quad 0x3fc4ec8c30000000
+                    .quad 0x3fc512a640000000
+                    .quad 0x3fc538a630000000
+                    .quad 0x3fc55e8c50000000
+                    .quad 0x3fc5845890000000
+                    .quad 0x3fc5aa0b40000000
+                    .quad 0x3fc5cfa470000000
+                    .quad 0x3fc5f52440000000
+                    .quad 0x3fc61a8ad0000000
+                    .quad 0x3fc63fd850000000
+                    .quad 0x3fc6650cd0000000
+                    .quad 0x3fc68a2880000000
+                    .quad 0x3fc6af2b80000000
+                    .quad 0x3fc6d415e0000000
+                    .quad 0x3fc6f8e7d0000000
+                    .quad 0x3fc71da170000000
+                    .quad 0x3fc74242e0000000
+                    .quad 0x3fc766cc40000000
+                    .quad 0x3fc78b3da0000000
+                    .quad 0x3fc7af9730000000
+                    .quad 0x3fc7d3d910000000
+                    .quad 0x3fc7f80350000000
+                    .quad 0x3fc81c1620000000
+                    .quad 0x3fc8401190000000
+                    .quad 0x3fc863f5c0000000
+                    .quad 0x3fc887c2e0000000
+                    .quad 0x3fc8ab7900000000
+                    .quad 0x3fc8cf1830000000
+                    .quad 0x3fc8f2a0a0000000
+                    .quad 0x3fc9161270000000
+                    .quad 0x3fc9396db0000000
+                    .quad 0x3fc95cb280000000
+                    .quad 0x3fc97fe100000000
+                    .quad 0x3fc9a2f950000000
+                    .quad 0x3fc9c5fb70000000
+                    .quad 0x3fc9e8e7b0000000
+                    .quad 0x3fca0bbdf0000000
+                    .quad 0x3fca2e7e80000000
+                    .quad 0x3fca512960000000
+                    .quad 0x3fca73bea0000000
+                    .quad 0x3fca963e70000000
+                    .quad 0x3fcab8a8f0000000
+                    .quad 0x3fcadafe20000000
+                    .quad 0x3fcafd3e30000000
+                    .quad 0x3fcb1f6930000000
+                    .quad 0x3fcb417f40000000
+                    .quad 0x3fcb638070000000
+                    .quad 0x3fcb856cf0000000
+                    .quad 0x3fcba744b0000000
+                    .quad 0x3fcbc907f0000000
+                    .quad 0x3fcbeab6c0000000
+                    .quad 0x3fcc0c5130000000
+                    .quad 0x3fcc2dd750000000
+                    .quad 0x3fcc4f4950000000
+                    .quad 0x3fcc70a740000000
+                    .quad 0x3fcc91f130000000
+                    .quad 0x3fccb32740000000
+                    .quad 0x3fccd44980000000
+                    .quad 0x3fccf55810000000
+                    .quad 0x3fcd165300000000
+                    .quad 0x3fcd373a60000000
+                    .quad 0x3fcd580e60000000
+                    .quad 0x3fcd78cf00000000
+                    .quad 0x3fcd997c70000000
+                    .quad 0x3fcdba16a0000000
+                    .quad 0x3fcdda9dd0000000
+                    .quad 0x3fcdfb11f0000000
+                    .quad 0x3fce1b7330000000
+                    .quad 0x3fce3bc1a0000000
+                    .quad 0x3fce5bfd50000000
+                    .quad 0x3fce7c2660000000
+                    .quad 0x3fce9c3ce0000000
+                    .quad 0x3fcebc40e0000000
+                    .quad 0x3fcedc3280000000
+                    .quad 0x3fcefc11d0000000
+                    .quad 0x3fcf1bdee0000000
+                    .quad 0x3fcf3b99d0000000
+                    .quad 0x3fcf5b42a0000000
+                    .quad 0x3fcf7ad980000000
+                    .quad 0x3fcf9a5e70000000
+                    .quad 0x3fcfb9d190000000
+                    .quad 0x3fcfd932f0000000
+                    .quad 0x3fcff882a0000000
+                    .quad 0x3fd00be050000000
+                    .quad 0x3fd01b76a0000000
+                    .quad 0x3fd02b0430000000
+                    .quad 0x3fd03a8910000000
+                    .quad 0x3fd04a0540000000
+                    .quad 0x3fd05978e0000000
+                    .quad 0x3fd068e3f0000000
+                    .quad 0x3fd0784670000000
+                    .quad 0x3fd087a080000000
+                    .quad 0x3fd096f210000000
+                    .quad 0x3fd0a63b30000000
+                    .quad 0x3fd0b57bf0000000
+                    .quad 0x3fd0c4b450000000
+                    .quad 0x3fd0d3e460000000
+                    .quad 0x3fd0e30c30000000
+                    .quad 0x3fd0f22bc0000000
+                    .quad 0x3fd1014310000000
+                    .quad 0x3fd1105240000000
+                    .quad 0x3fd11f5940000000
+                    .quad 0x3fd12e5830000000
+                    .quad 0x3fd13d4f00000000
+                    .quad 0x3fd14c3dd0000000
+                    .quad 0x3fd15b24a0000000
+                    .quad 0x3fd16a0370000000
+                    .quad 0x3fd178da50000000
+                    .quad 0x3fd187a940000000
+                    .quad 0x3fd1967060000000
+                    .quad 0x3fd1a52fa0000000
+                    .quad 0x3fd1b3e710000000
+                    .quad 0x3fd1c296c0000000
+                    .quad 0x3fd1d13eb0000000
+                    .quad 0x3fd1dfdef0000000
+                    .quad 0x3fd1ee7770000000
+                    .quad 0x3fd1fd0860000000
+                    .quad 0x3fd20b91a0000000
+                    .quad 0x3fd21a1350000000
+                    .quad 0x3fd2288d70000000
+                    .quad 0x3fd2370010000000
+                    .quad 0x3fd2456b30000000
+                    .quad 0x3fd253ced0000000
+                    .quad 0x3fd2622b00000000
+                    .quad 0x3fd2707fd0000000
+                    .quad 0x3fd27ecd40000000
+                    .quad 0x3fd28d1360000000
+                    .quad 0x3fd29b5220000000
+                    .quad 0x3fd2a989a0000000
+                    .quad 0x3fd2b7b9e0000000
+                    .quad 0x3fd2c5e2e0000000
+                    .quad 0x3fd2d404b0000000
+                    .quad 0x3fd2e21f50000000
+                    .quad 0x3fd2f032c0000000
+                    .quad 0x3fd2fe3f20000000
+                    .quad 0x3fd30c4470000000
+                    .quad 0x3fd31a42b0000000
+                    .quad 0x3fd32839e0000000
+                    .quad 0x3fd3362a10000000
+                    .quad 0x3fd3441350000000
+
+.align 16
+.L__log_256_tail:
+                    .quad 0x0000000000000000
+                    .quad 0x3db20abc22b2208f
+                    .quad 0x3db10f69332e0dd4
+                    .quad 0x3dce950de87ed257
+                    .quad 0x3dd3f3443b626d69
+                    .quad 0x3df45aeaa5363e57
+                    .quad 0x3dc443683ce1bf0b
+                    .quad 0x3df989cd60c6a511
+                    .quad 0x3dfd626f201f2e9f
+                    .quad 0x3de94f8bb8dabdcd
+                    .quad 0x3e0088d8ef423015
+                    .quad 0x3e080413a62b79ad
+                    .quad 0x3e059717c0eed3c4
+                    .quad 0x3dad4a77add44902
+                    .quad 0x3e0e763ff037300e
+                    .quad 0x3de162d74706f6c3
+                    .quad 0x3e0601cc1f4dbc14
+                    .quad 0x3deaf3e051f6e5bf
+                    .quad 0x3e097a0b1e1af3eb
+                    .quad 0x3dc0a38970c002c7
+                    .quad 0x3e102e000057c751
+                    .quad 0x3e155b00eecd6e0e
+                    .quad 0x3ddf86297003b5af
+                    .quad 0x3e1057b9b336a36d
+                    .quad 0x3e134bc84a06ea4f
+                    .quad 0x3e1643da9ea1bcad
+                    .quad 0x3e1d66a7b4f7ea2a
+                    .quad 0x3df6b2e038f7fcef
+                    .quad 0x3df3e954c670f088
+                    .quad 0x3e047209093acab3
+                    .quad 0x3e1d708fe7275da7
+                    .quad 0x3e1fdf9e7771b9e7
+                    .quad 0x3e0827bfa70a0660
+                    .quad 0x3e1601cc1f4dbc14
+                    .quad 0x3e0637f6106a5e5b
+                    .quad 0x3e126a13f17c624b
+                    .quad 0x3e093eb2ce80623a
+                    .quad 0x3e1430d1e91594de
+                    .quad 0x3e1d6b10108fa031
+                    .quad 0x3e16879c0bbaf241
+                    .quad 0x3dff08015ea6bc2b
+                    .quad 0x3e29b63dcdc6676c
+                    .quad 0x3e2b022cbcc4ab2c
+                    .quad 0x3df917d07ddd6544
+                    .quad 0x3e1540605703379e
+                    .quad 0x3e0cd18b947a1b60
+                    .quad 0x3e17ad65277ca97e
+                    .quad 0x3e11884dc59f5fa9
+                    .quad 0x3e1711c46006d082
+                    .quad 0x3e2f092e3c3108f8
+                    .quad 0x3e1714c5e32be13a
+                    .quad 0x3e26bba7fd734f9a
+                    .quad 0x3dfdf48fb5e08483
+                    .quad 0x3e232f9bc74d0b95
+                    .quad 0x3df973e848790c13
+                    .quad 0x3e1eccbc08c6586e
+                    .quad 0x3e2115e9f9524a98
+                    .quad 0x3e2f1740593131b8
+                    .quad 0x3e1bcf8b25643835
+                    .quad 0x3e1f5fa81d8bed80
+                    .quad 0x3e244a4df929d9e4
+                    .quad 0x3e129820d8220c94
+                    .quad 0x3e2a0b489304e309
+                    .quad 0x3e1f4d56aba665fe
+                    .quad 0x3e210c9019365163
+                    .quad 0x3df80f78fe592736
+                    .quad 0x3e10528825c81cca
+                    .quad 0x3de095537d6d746a
+                    .quad 0x3e1827bfa70a0660
+                    .quad 0x3e06b0a8ec45933c
+                    .quad 0x3e105af81bf5dba9
+                    .quad 0x3e17e2fa2655d515
+                    .quad 0x3e0d59ecbfaee4bf
+                    .quad 0x3e1d8b2fda683fa3
+                    .quad 0x3e24b8ddfd3a3737
+                    .quad 0x3e13827e61ae1204
+                    .quad 0x3e2c8c7b49e90f9f
+                    .quad 0x3e29eaf01597591d
+                    .quad 0x3e19aaa66e317b36
+                    .quad 0x3e2e725609720655
+                    .quad 0x3e261c33fc7aac54
+                    .quad 0x3e29662bcf61a252
+                    .quad 0x3e1843c811c42730
+                    .quad 0x3e2064bb0b5acb36
+                    .quad 0x3e0a340c842701a4
+                    .quad 0x3e1a8e55b58f79d6
+                    .quad 0x3de92d219c5e9d9a
+                    .quad 0x3e3f63e60d7ffd6a
+                    .quad 0x3e2e9b0ed9516314
+                    .quad 0x3e2923901962350c
+                    .quad 0x3e326f8838785e81
+                    .quad 0x3e3b5b6a4caba6af
+                    .quad 0x3df0226adc8e761c
+                    .quad 0x3e3c4ad7313a1aed
+                    .quad 0x3e1564e87c738d17
+                    .quad 0x3e338fecf18a6618
+                    .quad 0x3e3d929ef5777666
+                    .quad 0x3e39483bf08da0b8
+                    .quad 0x3e3bdd0eeeaa5826
+                    .quad 0x3e39c4dd590237ba
+                    .quad 0x3e1af3e9e0ebcac7
+                    .quad 0x3e35ce5382270dac
+                    .quad 0x3e394f74532ab9ba
+                    .quad 0x3e07342795888654
+                    .quad 0x3e0c5a000be34bf0
+                    .quad 0x3e2711c46006d082
+                    .quad 0x3e250025b4ed8cf8
+                    .quad 0x3e2ed18bcef2d2a0
+                    .quad 0x3e21282e0c0a7554
+                    .quad 0x3e0d70f33359a7ca
+                    .quad 0x3e2b7f7e13a84025
+                    .quad 0x3e33306ec321891e
+                    .quad 0x3e3fc7f8038b7550
+                    .quad 0x3e3eb0358cd71d64
+                    .quad 0x3e3a76c822859474
+                    .quad 0x3e3d0ec652de86e3
+                    .quad 0x3e2fa4cce08658af
+                    .quad 0x3e3b84a2d2c00a9e
+                    .quad 0x3e20a5b0f2c25bd1
+                    .quad 0x3e3dd660225bf699
+                    .quad 0x3e08b10f859bf037
+                    .quad 0x3e3e8823b590cbe1
+                    .quad 0x3e361311f31e96f6
+                    .quad 0x3e2e1f875ca20f9a
+                    .quad 0x3e2c95724939b9a5
+                    .quad 0x3e3805957a3e58e2
+                    .quad 0x3e2ff126ea9f0334
+                    .quad 0x3e3953f5598e5609
+                    .quad 0x3e36c16ff856c448
+                    .quad 0x3e24cb220ff261f4
+                    .quad 0x3e35e120d53d53a2
+                    .quad 0x3e3a527f6189f256
+                    .quad 0x3e3856fcffd49c0f
+                    .quad 0x3e300c2e8228d7da
+                    .quad 0x3df113d09444dfe0
+                    .quad 0x3e2510630eea59a6
+                    .quad 0x3e262e780f32d711
+                    .quad 0x3ded3ed91a10f8cf
+                    .quad 0x3e23654a7e4bcd85
+                    .quad 0x3e055b784980ad21
+                    .quad 0x3e212f2dd4b16e64
+                    .quad 0x3e37c4add939f50c
+                    .quad 0x3e281784627180fc
+                    .quad 0x3dea5162c7e14961
+                    .quad 0x3e310c9019365163
+                    .quad 0x3e373c4d2ba17688
+                    .quad 0x3e2ae8a5e0e93d81
+                    .quad 0x3e2ab0c6f01621af
+                    .quad 0x3e301e8b74dd5b66
+                    .quad 0x3e2d206fecbb5494
+                    .quad 0x3df0b48b724fcc00
+                    .quad 0x3e3f831f0b61e229
+                    .quad 0x3df81a97c407bcaf
+                    .quad 0x3e3e286c1ccbb7aa
+                    .quad 0x3e28630b49220a93
+                    .quad 0x3dff0b15c1a22c5c
+                    .quad 0x3e355445e71c0946
+                    .quad 0x3e3be630f8066d85
+                    .quad 0x3e2599dff0d96c39
+                    .quad 0x3e36cc85b18fb081
+                    .quad 0x3e34476d001ea8c8
+                    .quad 0x3e373f889e16d31f
+                    .quad 0x3e3357100d792a87
+                    .quad 0x3e3bd179ae6101f6
+                    .quad 0x3e0ca31056c3f6e2
+                    .quad 0x3e3d2870629c08fb
+                    .quad 0x3e3aba3880d2673f
+                    .quad 0x3e2c3633cb297da6
+                    .quad 0x3e21843899efea02
+                    .quad 0x3e3bccc99d2008e6
+                    .quad 0x3e38000544bdd350
+                    .quad 0x3e2b91c226606ae1
+                    .quad 0x3e2a7adf26b62bdf
+                    .quad 0x3e18764fc8826ec9
+                    .quad 0x3e1f4f3de50f68f0
+                    .quad 0x3df760ca757995e3
+                    .quad 0x3dfc667ed3805147
+                    .quad 0x3e3733f6196adf6f
+                    .quad 0x3e2fb710f33e836b
+                    .quad 0x3e39886eba641013
+                    .quad 0x3dfb5368d0af8c1a
+                    .quad 0x3e358c691b8d2971
+                    .quad 0x3dfe9465226d08fb
+                    .quad 0x3e33587e063f0097
+                    .quad 0x3e3618e702129f18
+                    .quad 0x3e361c33fc7aac54
+                    .quad 0x3e3f07a68408604a
+                    .quad 0x3e3c34bfe4945421
+                    .quad 0x3e38b1f00e41300b
+                    .quad 0x3e3f434284d61b63
+                    .quad 0x3e3a63095e397436
+                    .quad 0x3e34428656b919de
+                    .quad 0x3e36ca9201b2d9a6
+                    .quad 0x3e2738823a2a931c
+                    .quad 0x3e3c11880e179230
+                    .quad 0x3e313ddc8d6d52fe
+                    .quad 0x3e33eed58922e917
+                    .quad 0x3e295992846bdd50
+                    .quad 0x3e0ddb4d5f2e278b
+                    .quad 0x3df1a5f12a0635c4
+                    .quad 0x3e4642f0882c3c34
+                    .quad 0x3e2aee9ba7f6475e
+                    .quad 0x3e264b7f834a60e4
+                    .quad 0x3e290d42e243792e
+                    .quad 0x3e4c272008134f01
+                    .quad 0x3e4a782e16d6cf5b
+                    .quad 0x3e44505c79da6648
+                    .quad 0x3e4ca9d4ea4dcd21
+                    .quad 0x3e297d3d627cd5bc
+                    .quad 0x3e20b15cf9bcaa13
+                    .quad 0x3e315b2063cf76dd
+                    .quad 0x3e2983e6f3aa2748
+                    .quad 0x3e3f4c64f4ffe994
+                    .quad 0x3e46beba7ce85a0f
+                    .quad 0x3e3b9c69fd4ea6b8
+                    .quad 0x3e2b6aa5835fa4ab
+                    .quad 0x3e43ccc3790fedd1
+                    .quad 0x3e29c04cc4404fe0
+                    .quad 0x3e40734b7a75d89d
+                    .quad 0x3e1b4404c4e01612
+                    .quad 0x3e40c565c2ce4894
+                    .quad 0x3e33c71441d935cd
+                    .quad 0x3d72a492556b3b4e
+                    .quad 0x3e20fa090341dc43
+                    .quad 0x3e2e8f7009e3d9f4
+                    .quad 0x3e4b1bf68b048a45
+                    .quad 0x3e3eee52dffaa956
+                    .quad 0x3e456b0900e465bd
+                    .quad 0x3e4d929ef5777666
+                    .quad 0x3e486ea28637e260
+                    .quad 0x3e4665aff10ca2f0
+                    .quad 0x3e2f11fdaf48ec74
+                    .quad 0x3e4cbe1b86a4d1c7
+                    .quad 0x3e25b05bfea87665
+                    .quad 0x3e41cec20a1a4a1d
+                    .quad 0x3e41cd5f0a409b9f
+                    .quad 0x3e453656c8265070
+                    .quad 0x3e377ed835282260
+                    .quad 0x3e2417bc3040b9d2
+                    .quad 0x3e408eef7b79eff2
+                    .quad 0x3e4dc76f39dc57e9
+                    .quad 0x3e4c0493a70cf457
+                    .quad 0x3e4a83d6cea5a60c
+                    .quad 0x3e30d6700dc557ba
+                    .quad 0x3e44c96c12e8bd0a
+                    .quad 0x3e3d2c1993e32315
+                    .quad 0x3e22c721135f8242
+                    .quad 0x3e279a3e4dda747d
+                    .quad 0x3dfcf89f6941a72b
+                    .quad 0x3e2149a702f10831
+                    .quad 0x3e4ead4b7c8175db
+                    .quad 0x3e4e6930fe63e70a
+                    .quad 0x3e41e106bed9ee2f
+                    .quad 0x3e2d682b82f11c92
+                    .quad 0x3e3a07f188dba47c
+                    .quad 0x3e40f9342dc172f6
+                    .quad 0x3e03ef3fde623e25
+
+.align 16
+.L__log_F_inv:
+                    .quad 0x4000000000000000
+                    .quad 0x3fffe01fe01fe020
+                    .quad 0x3fffc07f01fc07f0
+                    .quad 0x3fffa11caa01fa12
+                    .quad 0x3fff81f81f81f820
+                    .quad 0x3fff6310aca0dbb5
+                    .quad 0x3fff44659e4a4271
+                    .quad 0x3fff25f644230ab5
+                    .quad 0x3fff07c1f07c1f08
+                    .quad 0x3ffee9c7f8458e02
+                    .quad 0x3ffecc07b301ecc0
+                    .quad 0x3ffeae807aba01eb
+                    .quad 0x3ffe9131abf0b767
+                    .quad 0x3ffe741aa59750e4
+                    .quad 0x3ffe573ac901e574
+                    .quad 0x3ffe3a9179dc1a73
+                    .quad 0x3ffe1e1e1e1e1e1e
+                    .quad 0x3ffe01e01e01e01e
+                    .quad 0x3ffde5d6e3f8868a
+                    .quad 0x3ffdca01dca01dca
+                    .quad 0x3ffdae6076b981db
+                    .quad 0x3ffd92f2231e7f8a
+                    .quad 0x3ffd77b654b82c34
+                    .quad 0x3ffd5cac807572b2
+                    .quad 0x3ffd41d41d41d41d
+                    .quad 0x3ffd272ca3fc5b1a
+                    .quad 0x3ffd0cb58f6ec074
+                    .quad 0x3ffcf26e5c44bfc6
+                    .quad 0x3ffcd85689039b0b
+                    .quad 0x3ffcbe6d9601cbe7
+                    .quad 0x3ffca4b3055ee191
+                    .quad 0x3ffc8b265afb8a42
+                    .quad 0x3ffc71c71c71c71c
+                    .quad 0x3ffc5894d10d4986
+                    .quad 0x3ffc3f8f01c3f8f0
+                    .quad 0x3ffc26b5392ea01c
+                    .quad 0x3ffc0e070381c0e0
+                    .quad 0x3ffbf583ee868d8b
+                    .quad 0x3ffbdd2b899406f7
+                    .quad 0x3ffbc4fd65883e7b
+                    .quad 0x3ffbacf914c1bad0
+                    .quad 0x3ffb951e2b18ff23
+                    .quad 0x3ffb7d6c3dda338b
+                    .quad 0x3ffb65e2e3beee05
+                    .quad 0x3ffb4e81b4e81b4f
+                    .quad 0x3ffb37484ad806ce
+                    .quad 0x3ffb2036406c80d9
+                    .quad 0x3ffb094b31d922a4
+                    .quad 0x3ffaf286bca1af28
+                    .quad 0x3ffadbe87f94905e
+                    .quad 0x3ffac5701ac5701b
+                    .quad 0x3ffaaf1d2f87ebfd
+                    .quad 0x3ffa98ef606a63be
+                    .quad 0x3ffa82e65130e159
+                    .quad 0x3ffa6d01a6d01a6d
+                    .quad 0x3ffa574107688a4a
+                    .quad 0x3ffa41a41a41a41a
+                    .quad 0x3ffa2c2a87c51ca0
+                    .quad 0x3ffa16d3f97a4b02
+                    .quad 0x3ffa01a01a01a01a
+                    .quad 0x3ff9ec8e951033d9
+                    .quad 0x3ff9d79f176b682d
+                    .quad 0x3ff9c2d14ee4a102
+                    .quad 0x3ff9ae24ea5510da
+                    .quad 0x3ff999999999999a
+                    .quad 0x3ff9852f0d8ec0ff
+                    .quad 0x3ff970e4f80cb872
+                    .quad 0x3ff95cbb0be377ae
+                    .quad 0x3ff948b0fcd6e9e0
+                    .quad 0x3ff934c67f9b2ce6
+                    .quad 0x3ff920fb49d0e229
+                    .quad 0x3ff90d4f120190d5
+                    .quad 0x3ff8f9c18f9c18fa
+                    .quad 0x3ff8e6527af1373f
+                    .quad 0x3ff8d3018d3018d3
+                    .quad 0x3ff8bfce8062ff3a
+                    .quad 0x3ff8acb90f6bf3aa
+                    .quad 0x3ff899c0f601899c
+                    .quad 0x3ff886e5f0abb04a
+                    .quad 0x3ff87427bcc092b9
+                    .quad 0x3ff8618618618618
+                    .quad 0x3ff84f00c2780614
+                    .quad 0x3ff83c977ab2bedd
+                    .quad 0x3ff82a4a0182a4a0
+                    .quad 0x3ff8181818181818
+                    .quad 0x3ff8060180601806
+                    .quad 0x3ff7f405fd017f40
+                    .quad 0x3ff7e225515a4f1d
+                    .quad 0x3ff7d05f417d05f4
+                    .quad 0x3ff7beb3922e017c
+                    .quad 0x3ff7ad2208e0ecc3
+                    .quad 0x3ff79baa6bb6398b
+                    .quad 0x3ff78a4c8178a4c8
+                    .quad 0x3ff77908119ac60d
+                    .quad 0x3ff767dce434a9b1
+                    .quad 0x3ff756cac201756d
+                    .quad 0x3ff745d1745d1746
+                    .quad 0x3ff734f0c541fe8d
+                    .quad 0x3ff724287f46debc
+                    .quad 0x3ff713786d9c7c09
+                    .quad 0x3ff702e05c0b8170
+                    .quad 0x3ff6f26016f26017
+                    .quad 0x3ff6e1f76b4337c7
+                    .quad 0x3ff6d1a62681c861
+                    .quad 0x3ff6c16c16c16c17
+                    .quad 0x3ff6b1490aa31a3d
+                    .quad 0x3ff6a13cd1537290
+                    .quad 0x3ff691473a88d0c0
+                    .quad 0x3ff6816816816817
+                    .quad 0x3ff6719f3601671a
+                    .quad 0x3ff661ec6a5122f9
+                    .quad 0x3ff6524f853b4aa3
+                    .quad 0x3ff642c8590b2164
+                    .quad 0x3ff63356b88ac0de
+                    .quad 0x3ff623fa77016240
+                    .quad 0x3ff614b36831ae94
+                    .quad 0x3ff6058160581606
+                    .quad 0x3ff5f66434292dfc
+                    .quad 0x3ff5e75bb8d015e7
+                    .quad 0x3ff5d867c3ece2a5
+                    .quad 0x3ff5c9882b931057
+                    .quad 0x3ff5babcc647fa91
+                    .quad 0x3ff5ac056b015ac0
+                    .quad 0x3ff59d61f123ccaa
+                    .quad 0x3ff58ed2308158ed
+                    .quad 0x3ff5805601580560
+                    .quad 0x3ff571ed3c506b3a
+                    .quad 0x3ff56397ba7c52e2
+                    .quad 0x3ff5555555555555
+                    .quad 0x3ff54725e6bb82fe
+                    .quad 0x3ff5390948f40feb
+                    .quad 0x3ff52aff56a8054b
+                    .quad 0x3ff51d07eae2f815
+                    .quad 0x3ff50f22e111c4c5
+                    .quad 0x3ff5015015015015
+                    .quad 0x3ff4f38f62dd4c9b
+                    .quad 0x3ff4e5e0a72f0539
+                    .quad 0x3ff4d843bedc2c4c
+                    .quad 0x3ff4cab88725af6e
+                    .quad 0x3ff4bd3edda68fe1
+                    .quad 0x3ff4afd6a052bf5b
+                    .quad 0x3ff4a27fad76014a
+                    .quad 0x3ff49539e3b2d067
+                    .quad 0x3ff4880522014880
+                    .quad 0x3ff47ae147ae147b
+                    .quad 0x3ff46dce34596066
+                    .quad 0x3ff460cbc7f5cf9a
+                    .quad 0x3ff453d9e2c776ca
+                    .quad 0x3ff446f86562d9fb
+                    .quad 0x3ff43a2730abee4d
+                    .quad 0x3ff42d6625d51f87
+                    .quad 0x3ff420b5265e5951
+                    .quad 0x3ff4141414141414
+                    .quad 0x3ff40782d10e6566
+                    .quad 0x3ff3fb013fb013fb
+                    .quad 0x3ff3ee8f42a5af07
+                    .quad 0x3ff3e22cbce4a902
+                    .quad 0x3ff3d5d991aa75c6
+                    .quad 0x3ff3c995a47babe7
+                    .quad 0x3ff3bd60d9232955
+                    .quad 0x3ff3b13b13b13b14
+                    .quad 0x3ff3a524387ac822
+                    .quad 0x3ff3991c2c187f63
+                    .quad 0x3ff38d22d366088e
+                    .quad 0x3ff3813813813814
+                    .quad 0x3ff3755bd1c945ee
+                    .quad 0x3ff3698df3de0748
+                    .quad 0x3ff35dce5f9f2af8
+                    .quad 0x3ff3521cfb2b78c1
+                    .quad 0x3ff34679ace01346
+                    .quad 0x3ff33ae45b57bcb2
+                    .quad 0x3ff32f5ced6a1dfa
+                    .quad 0x3ff323e34a2b10bf
+                    .quad 0x3ff3187758e9ebb6
+                    .quad 0x3ff30d190130d190
+                    .quad 0x3ff301c82ac40260
+                    .quad 0x3ff2f684bda12f68
+                    .quad 0x3ff2eb4ea1fed14b
+                    .quad 0x3ff2e025c04b8097
+                    .quad 0x3ff2d50a012d50a0
+                    .quad 0x3ff2c9fb4d812ca0
+                    .quad 0x3ff2bef98e5a3711
+                    .quad 0x3ff2b404ad012b40
+                    .quad 0x3ff2a91c92f3c105
+                    .quad 0x3ff29e4129e4129e
+                    .quad 0x3ff293725bb804a5
+                    .quad 0x3ff288b01288b013
+                    .quad 0x3ff27dfa38a1ce4d
+                    .quad 0x3ff27350b8812735
+                    .quad 0x3ff268b37cd60127
+                    .quad 0x3ff25e22708092f1
+                    .quad 0x3ff2539d7e9177b2
+                    .quad 0x3ff2492492492492
+                    .quad 0x3ff23eb79717605b
+                    .quad 0x3ff23456789abcdf
+                    .quad 0x3ff22a0122a0122a
+                    .quad 0x3ff21fb78121fb78
+                    .quad 0x3ff21579804855e6
+                    .quad 0x3ff20b470c67c0d9
+                    .quad 0x3ff2012012012012
+                    .quad 0x3ff1f7047dc11f70
+                    .quad 0x3ff1ecf43c7fb84c
+                    .quad 0x3ff1e2ef3b3fb874
+                    .quad 0x3ff1d8f5672e4abd
+                    .quad 0x3ff1cf06ada2811d
+                    .quad 0x3ff1c522fc1ce059
+                    .quad 0x3ff1bb4a4046ed29
+                    .quad 0x3ff1b17c67f2bae3
+                    .quad 0x3ff1a7b9611a7b96
+                    .quad 0x3ff19e0119e0119e
+                    .quad 0x3ff19453808ca29c
+                    .quad 0x3ff18ab083902bdb
+                    .quad 0x3ff1811811811812
+                    .quad 0x3ff1778a191bd684
+                    .quad 0x3ff16e0689427379
+                    .quad 0x3ff1648d50fc3201
+                    .quad 0x3ff15b1e5f75270d
+                    .quad 0x3ff151b9a3fdd5c9
+                    .quad 0x3ff1485f0e0acd3b
+                    .quad 0x3ff13f0e8d344724
+                    .quad 0x3ff135c81135c811
+                    .quad 0x3ff12c8b89edc0ac
+                    .quad 0x3ff12358e75d3033
+                    .quad 0x3ff11a3019a74826
+                    .quad 0x3ff1111111111111
+                    .quad 0x3ff107fbbe011080
+                    .quad 0x3ff0fef010fef011
+                    .quad 0x3ff0f5edfab325a2
+                    .quad 0x3ff0ecf56be69c90
+                    .quad 0x3ff0e40655826011
+                    .quad 0x3ff0db20a88f4696
+                    .quad 0x3ff0d24456359e3a
+                    .quad 0x3ff0c9714fbcda3b
+                    .quad 0x3ff0c0a7868b4171
+                    .quad 0x3ff0b7e6ec259dc8
+                    .quad 0x3ff0af2f722eecb5
+                    .quad 0x3ff0a6810a6810a7
+                    .quad 0x3ff09ddba6af8360
+                    .quad 0x3ff0953f39010954
+                    .quad 0x3ff08cabb37565e2
+                    .quad 0x3ff0842108421084
+                    .quad 0x3ff07b9f29b8eae2
+                    .quad 0x3ff073260a47f7c6
+                    .quad 0x3ff06ab59c7912fb
+                    .quad 0x3ff0624dd2f1a9fc
+                    .quad 0x3ff059eea0727586
+                    .quad 0x3ff05197f7d73404
+                    .quad 0x3ff04949cc1664c5
+                    .quad 0x3ff0410410410410
+                    .quad 0x3ff038c6b78247fc
+                    .quad 0x3ff03091b51f5e1a
+                    .quad 0x3ff02864fc7729e9
+                    .quad 0x3ff0204081020408
+                    .quad 0x3ff0182436517a37
+                    .quad 0x3ff0101010101010
+                    .quad 0x3ff0080402010080
+                    .quad 0x3ff0000000000000
+                    .quad 0x0000000000000000
+
+

diff --git a/src/gas/log10f.S b/src/gas/log10f.S
new file mode 100644
index 0000000..eb89c6c
--- /dev/null
+++ b/src/gas/log10f.S

@@ -0,0 +1,745 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# log10f.S
+#
+# An implementation of the log10f libm function.
+#
+# Prototype:
+#
+#     float log10f(float x);
+#
+
+#
+#   Algorithm:
+#       Similar to one presnted in log.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(log10f)
+#define fname_special _log10f_special@PLT
+
+
+# local variable storage offsets
+.equ    p_temp, 0x0
+.equ    stack_size, 0x18
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+    sub         $stack_size, %rsp
+
+    # compute exponent part
+    xor         %eax, %eax
+    movdqa      %xmm0, %xmm3
+    movss       %xmm0, %xmm4
+    psrld       $23, %xmm3
+    movd        %xmm0, %eax
+    psubd       .L__mask_127(%rip), %xmm3
+    movdqa      %xmm0, %xmm2
+    cvtdq2ps    %xmm3, %xmm5 # xexp
+
+    #  NaN or inf
+    movdqa      %xmm0, %xmm1
+    andps       .L__real_inf(%rip), %xmm1
+    comiss      .L__real_inf(%rip), %xmm1
+    je          .L__x_is_inf_or_nan
+
+    # check for negative numbers or zero
+    xorps       %xmm1, %xmm1
+    comiss      %xmm1, %xmm0
+    jbe         .L__x_is_zero_or_neg
+
+    pand        .L__real_mant(%rip), %xmm2
+    subss       .L__real_one(%rip), %xmm4
+
+    comiss      .L__real_neg127(%rip), %xmm5
+    je          .L__denormal_adjust
+
+.L__continue_common:
+
+    # compute index into the log tables
+    mov         %eax, %r9d
+    and         .L__mask_mant_all7(%rip), %eax
+    and         .L__mask_mant8(%rip), %r9d
+    shl         $1, %r9d
+    add         %r9d, %eax
+    mov         %eax, p_temp(%rsp)
+
+    # near one codepath
+    andps       .L__real_notsign(%rip), %xmm4
+    comiss      .L__real_threshold(%rip), %xmm4
+    jb          .L__near_one
+
+    # F, Y
+    movss       p_temp(%rsp), %xmm1
+    shr         $16, %eax
+    por         .L__real_half(%rip), %xmm2
+    por         .L__real_half(%rip), %xmm1
+    lea         .L__log_F_inv(%rip), %r9
+
+    # f = F - Y, r = f * inv
+    subss       %xmm2, %xmm1
+    mulss       (%r9,%rax,4), %xmm1
+
+    movss       %xmm1, %xmm2
+    movss       %xmm1, %xmm0
+
+    # poly
+    mulss       .L__real_1_over_3(%rip), %xmm2
+    mulss       %xmm1, %xmm0
+    addss       .L__real_1_over_2(%rip), %xmm2
+    movss       .L__real_log10_2_tail(%rip), %xmm3
+
+    lea         .L__log_128_tail(%rip), %r9
+    lea         .L__log_128_lead(%rip), %r10
+
+    mulss       %xmm0, %xmm2
+    mulss       %xmm5, %xmm3
+    addss       %xmm2, %xmm1
+
+    mulss       .L__real_log10_e(%rip), %xmm1
+
+    # m*log(10) + log10(G) - poly
+    movss       .L__real_log10_2_lead(%rip), %xmm0
+    subss       %xmm1, %xmm3 # z2
+    mulss       %xmm5, %xmm0
+    addss       (%r9,%rax,4), %xmm3
+    addss       (%r10,%rax,4), %xmm0
+
+    addss       %xmm3, %xmm0
+
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__near_one:
+    # r = x - 1.0#
+    movss       .L__real_two(%rip), %xmm2
+    subss       .L__real_one(%rip), %xmm0
+
+    # u = r / (2.0 + r)
+    addss       %xmm0, %xmm2
+    movss       %xmm0, %xmm1
+    divss       %xmm2, %xmm1 # u
+
+    # correction = r * u
+    movss       %xmm0, %xmm4
+    mulss       %xmm1, %xmm4
+
+    # u = u + u#
+    addss       %xmm1, %xmm1
+    movss       %xmm1, %xmm2
+    mulss       %xmm2, %xmm2 # v = u^2
+
+    # r2 = (u * v * (ca_1 + v * ca_2) - correction)
+    movss       %xmm1, %xmm3
+    mulss       %xmm2, %xmm3 # u^3
+    mulss       .L__real_ca2(%rip), %xmm2 # Bu^2
+    addss       .L__real_ca1(%rip), %xmm2 # +A
+    mulss       %xmm3, %xmm2
+    subss       %xmm4, %xmm2 # -correction
+
+    movdqa      %xmm0, %xmm5
+    pand        .L__mask_lower(%rip), %xmm5
+    subss       %xmm5, %xmm0
+    addss       %xmm0, %xmm2
+
+    movss       %xmm5, %xmm0
+    movss       %xmm2, %xmm1
+
+    mulss       .L__real_log10_e_tail(%rip), %xmm2
+    mulss       .L__real_log10_e_tail(%rip), %xmm0
+    mulss       .L__real_log10_e_lead(%rip), %xmm1
+    mulss       .L__real_log10_e_lead(%rip), %xmm5
+
+    addss       %xmm2, %xmm0
+    addss       %xmm1, %xmm0
+    addss       %xmm5, %xmm0
+
+    add         $stack_size, %rsp
+    ret
+
+.L__denormal_adjust:
+    por         .L__real_one(%rip), %xmm2
+    subss       .L__real_one(%rip), %xmm2
+    movdqa      %xmm2, %xmm5
+    pand        .L__real_mant(%rip), %xmm2
+    movd        %xmm2, %eax
+    psrld       $23, %xmm5
+    psubd       .L__mask_253(%rip), %xmm5
+    cvtdq2ps    %xmm5, %xmm5
+    jmp         .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+    jne         .L__x_is_neg
+
+    movss       .L__real_ninf(%rip), %xmm1
+    mov         .L__flag_x_zero(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+    movss       .L__real_nan(%rip), %xmm1
+    mov         .L__flag_x_neg(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+    cmp         .L__real_inf(%rip), %eax
+    je          .L__finish
+
+    cmp         .L__real_ninf(%rip), %eax
+    je          .L__x_is_neg
+
+    mov         .L__real_qnanbit(%rip), %r9d
+    and         %eax, %r9d
+    jnz         .L__finish
+
+    or          .L__real_qnanbit(%rip), %eax
+    movd        %eax, %xmm1
+    mov         .L__flag_x_nan(%rip), %edi
+    call        fname_special
+    jmp         .L__finish    
+
+.p2align 4,,15
+.L__finish:
+    add         $stack_size, %rsp
+    ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero:        .long 00000001
+.L__flag_x_neg:         .long 00000002
+.L__flag_x_nan:         .long 00000003
+
+.align 16
+
+.L__real_one:           .quad 0x03f8000003f800000   # 1.0
+                        .quad 0x03f8000003f800000
+.L__real_two:           .quad 0x04000000040000000   # 1.0
+                        .quad 0x04000000040000000
+.L__real_ninf:          .quad 0x0ff800000ff800000   # -inf
+                        .quad 0x0ff800000ff800000
+.L__real_inf:           .quad 0x07f8000007f800000   # +inf
+                        .quad 0x07f8000007f800000
+.L__real_nan:           .quad 0x07fc000007fc00000   # NaN
+                        .quad 0x07fc000007fc00000
+.L__real_ef:            .quad 0x0402DF854402DF854   # float e
+                        .quad 0x0402DF854402DF854
+.L__real_neg_qnan:      .quad 0x0ffc00000ffc00000
+                        .quad 0x0ffc00000ffc00000
+
+.L__real_sign:          .quad 0x08000000080000000   # sign bit
+                        .quad 0x08000000080000000
+.L__real_notsign:       .quad 0x07ffFFFFF7ffFFFFF   # ^sign bit
+                        .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:       .quad 0x00040000000400000   # quiet nan bit
+                        .quad 0x00040000000400000
+.L__real_mant:          .quad 0x0007FFFFF007FFFFF   # mantissa bits
+                        .quad 0x0007FFFFF007FFFFF
+.L__mask_127:           .quad 0x00000007f0000007f   # 
+                        .quad 0x00000007f0000007f
+
+.L__mask_mant_all7:     .quad 0x00000000007f0000
+                        .quad 0x00000000007f0000
+.L__mask_mant8:         .quad 0x0000000000008000
+                        .quad 0x0000000000008000
+
+.L__real_ca1:           .quad 0x03DAAAAAB3DAAAAAB   # 8.33333333333317923934e-02
+                        .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:           .quad 0x03C4CCCCD3C4CCCCD   # 1.25000000037717509602e-02
+                        .quad 0x03C4CCCCD3C4CCCCD
+
+.L__real_log2_lead:     .quad 0x03F3170003F317000   # 0.693115234375
+                        .quad 0x03F3170003F317000
+.L__real_log2_tail:     .quad 0x03805FDF43805FDF4   # 0.000031946183
+                        .quad 0x03805FDF43805FDF4
+.L__real_half:          .quad 0x03f0000003f000000   # 1/2
+                        .quad 0x03f0000003f000000
+
+.L__real_log10_e_lead:  .quad 0x3EDE00003EDE0000    # log10e_lead  0.4335937500
+                        .quad 0x3EDE00003EDE0000
+.L__real_log10_e_tail:  .quad 0x3A37B1523A37B152    # log10e_tail  0.0007007319
+                        .quad 0x3A37B1523A37B152
+
+.L__real_log10_2_lead:  .quad 0x3e9a00003e9a0000
+                        .quad 0x0000000000000000
+.L__real_log10_2_tail:  .quad 0x39826a1339826a13
+                        .quad 0x0000000000000000
+.L__real_log10_e:       .quad 0x3ede5bd93ede5bd9
+                        .quad 0x0000000000000000
+
+.L__mask_lower:         .quad 0x0ffff0000ffff0000
+                        .quad 0x0ffff0000ffff0000
+
+.align 16
+
+.L__real_neg127:    .long 0x0c2fe0000
+                    .long 0
+                    .quad 0
+
+.L__mask_253:       .long 0x000000fd
+                    .long 0
+                    .quad 0
+
+.L__real_threshold: .long 0x3d800000
+                    .long 0
+                    .quad 0
+
+.L__mask_01:        .long 0x00000001
+                    .long 0
+                    .quad 0
+
+.L__mask_80:        .long 0x00000080
+                    .long 0
+                    .quad 0
+
+.L__real_3b800000:  .long 0x3b800000
+                    .long 0
+                    .quad 0
+
+.L__real_1_over_3:  .long 0x3eaaaaab
+                    .long 0
+                    .quad 0
+
+.L__real_1_over_2:  .long 0x3f000000
+                    .long 0
+                    .quad 0
+
+.align 16
+.L__log_128_lead:
+                    .long 0x00000000
+                    .long 0x3b5d4000
+                    .long 0x3bdc8000
+                    .long 0x3c24c000
+                    .long 0x3c5ac000
+                    .long 0x3c884000
+                    .long 0x3ca2c000
+                    .long 0x3cbd4000
+                    .long 0x3cd78000
+                    .long 0x3cf1c000
+                    .long 0x3d05c000
+                    .long 0x3d128000
+                    .long 0x3d1f4000
+                    .long 0x3d2c0000
+                    .long 0x3d388000
+                    .long 0x3d450000
+                    .long 0x3d518000
+                    .long 0x3d5dc000
+                    .long 0x3d6a0000
+                    .long 0x3d760000
+                    .long 0x3d810000
+                    .long 0x3d870000
+                    .long 0x3d8d0000
+                    .long 0x3d92c000
+                    .long 0x3d98c000
+                    .long 0x3d9e8000
+                    .long 0x3da44000
+                    .long 0x3daa0000
+                    .long 0x3dafc000
+                    .long 0x3db58000
+                    .long 0x3dbb4000
+                    .long 0x3dc0c000
+                    .long 0x3dc64000
+                    .long 0x3dcc0000
+                    .long 0x3dd18000
+                    .long 0x3dd6c000
+                    .long 0x3ddc4000
+                    .long 0x3de1c000
+                    .long 0x3de70000
+                    .long 0x3dec8000
+                    .long 0x3df1c000
+                    .long 0x3df70000
+                    .long 0x3dfc4000
+                    .long 0x3e00c000
+                    .long 0x3e034000
+                    .long 0x3e05c000
+                    .long 0x3e088000
+                    .long 0x3e0b0000
+                    .long 0x3e0d8000
+                    .long 0x3e100000
+                    .long 0x3e128000
+                    .long 0x3e150000
+                    .long 0x3e178000
+                    .long 0x3e1a0000
+                    .long 0x3e1c8000
+                    .long 0x3e1ec000
+                    .long 0x3e214000
+                    .long 0x3e23c000
+                    .long 0x3e260000
+                    .long 0x3e288000
+                    .long 0x3e2ac000
+                    .long 0x3e2d4000
+                    .long 0x3e2f8000
+                    .long 0x3e31c000
+                    .long 0x3e344000
+                    .long 0x3e368000
+                    .long 0x3e38c000
+                    .long 0x3e3b0000
+                    .long 0x3e3d4000
+                    .long 0x3e3fc000
+                    .long 0x3e420000
+                    .long 0x3e440000
+                    .long 0x3e464000
+                    .long 0x3e488000
+                    .long 0x3e4ac000
+                    .long 0x3e4d0000
+                    .long 0x3e4f4000
+                    .long 0x3e514000
+                    .long 0x3e538000
+                    .long 0x3e55c000
+                    .long 0x3e57c000
+                    .long 0x3e5a0000
+                    .long 0x3e5c0000
+                    .long 0x3e5e4000
+                    .long 0x3e604000
+                    .long 0x3e624000
+                    .long 0x3e648000
+                    .long 0x3e668000
+                    .long 0x3e688000
+                    .long 0x3e6ac000
+                    .long 0x3e6cc000
+                    .long 0x3e6ec000
+                    .long 0x3e70c000
+                    .long 0x3e72c000
+                    .long 0x3e74c000
+                    .long 0x3e76c000
+                    .long 0x3e78c000
+                    .long 0x3e7ac000
+                    .long 0x3e7cc000
+                    .long 0x3e7ec000
+                    .long 0x3e804000
+                    .long 0x3e814000
+                    .long 0x3e824000
+                    .long 0x3e834000
+                    .long 0x3e840000
+                    .long 0x3e850000
+                    .long 0x3e860000
+                    .long 0x3e870000
+                    .long 0x3e880000
+                    .long 0x3e88c000
+                    .long 0x3e89c000
+                    .long 0x3e8ac000
+                    .long 0x3e8bc000
+                    .long 0x3e8c8000
+                    .long 0x3e8d8000
+                    .long 0x3e8e8000
+                    .long 0x3e8f4000
+                    .long 0x3e904000
+                    .long 0x3e914000
+                    .long 0x3e920000
+                    .long 0x3e930000
+                    .long 0x3e93c000
+                    .long 0x3e94c000
+                    .long 0x3e958000
+                    .long 0x3e968000
+                    .long 0x3e978000
+                    .long 0x3e984000
+                    .long 0x3e994000
+                    .long 0x3e9a0000
+
+.align 16
+.L__log_128_tail:
+                    .long 0x00000000
+                    .long 0x367a8e44
+                    .long 0x368ed49f
+                    .long 0x36c21451
+                    .long 0x375211d6
+                    .long 0x3720ea11
+                    .long 0x37e9eb59
+                    .long 0x37b87be7
+                    .long 0x37bf2560
+                    .long 0x33d597a0
+                    .long 0x37806a05
+                    .long 0x3820581f
+                    .long 0x38223334
+                    .long 0x378e3bac
+                    .long 0x3810684f
+                    .long 0x37feb7ae
+                    .long 0x36a9d609
+                    .long 0x37a68163
+                    .long 0x376a8b27
+                    .long 0x384c8fd6
+                    .long 0x3885183e
+                    .long 0x3874a760
+                    .long 0x380d1154
+                    .long 0x38ea42bd
+                    .long 0x384c1571
+                    .long 0x38ba66b8
+                    .long 0x38e7da3b
+                    .long 0x38eee632
+                    .long 0x38d00911
+                    .long 0x388bbede
+                    .long 0x378a0512
+                    .long 0x3894c7a0
+                    .long 0x38e30710
+                    .long 0x36db2829
+                    .long 0x3729d609
+                    .long 0x38fa0e82
+                    .long 0x38bc9a75
+                    .long 0x383a9297
+                    .long 0x38dc83c8
+                    .long 0x37eac335
+                    .long 0x38706ac3
+                    .long 0x389574c2
+                    .long 0x3892d068
+                    .long 0x38615032
+                    .long 0x3917acf4
+                    .long 0x3967a126
+                    .long 0x38217840
+                    .long 0x38b420ab
+                    .long 0x38f9c7b2
+                    .long 0x391103bd
+                    .long 0x39169a6b
+                    .long 0x390dd194
+                    .long 0x38eda471
+                    .long 0x38a38950
+                    .long 0x37f6844a
+                    .long 0x395e1cdb
+                    .long 0x390fcffc
+                    .long 0x38503e9d
+                    .long 0x394b00fd
+                    .long 0x38a9910a
+                    .long 0x39518a31
+                    .long 0x3882d2c2
+                    .long 0x392488e4
+                    .long 0x397b0aff
+                    .long 0x388a22d8
+                    .long 0x3902bd5e
+                    .long 0x39342f85
+                    .long 0x39598811
+                    .long 0x3972e6b1
+                    .long 0x34d53654
+                    .long 0x360ca25e
+                    .long 0x39785cc0
+                    .long 0x39630710
+                    .long 0x39424ed7
+                    .long 0x39165101
+                    .long 0x38be5421
+                    .long 0x37e7b0c0
+                    .long 0x394fd0c3
+                    .long 0x38efaaaa
+                    .long 0x37a8f566
+                    .long 0x3927c744
+                    .long 0x383fa4d5
+                    .long 0x392d9e39
+                    .long 0x3803feae
+                    .long 0x390a268c
+                    .long 0x39692b80
+                    .long 0x38789b4f
+                    .long 0x3909307d
+                    .long 0x394a601c
+                    .long 0x35e67edc
+                    .long 0x383e386d
+                    .long 0x38a7743d
+                    .long 0x38dccec3
+                    .long 0x38ff57e0
+                    .long 0x39079d8b
+                    .long 0x390651a6
+                    .long 0x38f7bad9
+                    .long 0x38d0ab82
+                    .long 0x38979e7d
+                    .long 0x381978ee
+                    .long 0x397816c8
+                    .long 0x39410cb2
+                    .long 0x39015384
+                    .long 0x3863fa28
+                    .long 0x39f41065
+                    .long 0x39c7668a
+                    .long 0x39968afa
+                    .long 0x39430db9
+                    .long 0x38a18cf3
+                    .long 0x39eb2907
+                    .long 0x39a9e10c
+                    .long 0x39492800
+                    .long 0x385a53d1
+                    .long 0x39ce0cf7
+                    .long 0x3979c7b2
+                    .long 0x389f5d99
+                    .long 0x39ceefcb
+                    .long 0x39646a39
+                    .long 0x380d7a9b
+                    .long 0x39ad6650
+                    .long 0x390ac3b8
+                    .long 0x39d9a9a8
+                    .long 0x39548a99
+                    .long 0x39f73c4b
+                    .long 0x3980960e
+                    .long 0x374b3d5a
+                    .long 0x39888f1e
+                    .long 0x37679a07
+                    .long 0x39826a13
+
+.align 16
+.L__log_F_inv:
+                    .long 0x40000000
+                    .long 0x3ffe03f8
+                    .long 0x3ffc0fc1
+                    .long 0x3ffa232d
+                    .long 0x3ff83e10
+                    .long 0x3ff6603e
+                    .long 0x3ff4898d
+                    .long 0x3ff2b9d6
+                    .long 0x3ff0f0f1
+                    .long 0x3fef2eb7
+                    .long 0x3fed7304
+                    .long 0x3febbdb3
+                    .long 0x3fea0ea1
+                    .long 0x3fe865ac
+                    .long 0x3fe6c2b4
+                    .long 0x3fe52598
+                    .long 0x3fe38e39
+                    .long 0x3fe1fc78
+                    .long 0x3fe07038
+                    .long 0x3fdee95c
+                    .long 0x3fdd67c9
+                    .long 0x3fdbeb62
+                    .long 0x3fda740e
+                    .long 0x3fd901b2
+                    .long 0x3fd79436
+                    .long 0x3fd62b81
+                    .long 0x3fd4c77b
+                    .long 0x3fd3680d
+                    .long 0x3fd20d21
+                    .long 0x3fd0b6a0
+                    .long 0x3fcf6475
+                    .long 0x3fce168a
+                    .long 0x3fcccccd
+                    .long 0x3fcb8728
+                    .long 0x3fca4588
+                    .long 0x3fc907da
+                    .long 0x3fc7ce0c
+                    .long 0x3fc6980c
+                    .long 0x3fc565c8
+                    .long 0x3fc43730
+                    .long 0x3fc30c31
+                    .long 0x3fc1e4bc
+                    .long 0x3fc0c0c1
+                    .long 0x3fbfa030
+                    .long 0x3fbe82fa
+                    .long 0x3fbd6910
+                    .long 0x3fbc5264
+                    .long 0x3fbb3ee7
+                    .long 0x3fba2e8c
+                    .long 0x3fb92144
+                    .long 0x3fb81703
+                    .long 0x3fb70fbb
+                    .long 0x3fb60b61
+                    .long 0x3fb509e7
+                    .long 0x3fb40b41
+                    .long 0x3fb30f63
+                    .long 0x3fb21643
+                    .long 0x3fb11fd4
+                    .long 0x3fb02c0b
+                    .long 0x3faf3ade
+                    .long 0x3fae4c41
+                    .long 0x3fad602b
+                    .long 0x3fac7692
+                    .long 0x3fab8f6a
+                    .long 0x3faaaaab
+                    .long 0x3fa9c84a
+                    .long 0x3fa8e83f
+                    .long 0x3fa80a81
+                    .long 0x3fa72f05
+                    .long 0x3fa655c4
+                    .long 0x3fa57eb5
+                    .long 0x3fa4a9cf
+                    .long 0x3fa3d70a
+                    .long 0x3fa3065e
+                    .long 0x3fa237c3
+                    .long 0x3fa16b31
+                    .long 0x3fa0a0a1
+                    .long 0x3f9fd80a
+                    .long 0x3f9f1166
+                    .long 0x3f9e4cad
+                    .long 0x3f9d89d9
+                    .long 0x3f9cc8e1
+                    .long 0x3f9c09c1
+                    .long 0x3f9b4c70
+                    .long 0x3f9a90e8
+                    .long 0x3f99d723
+                    .long 0x3f991f1a
+                    .long 0x3f9868c8
+                    .long 0x3f97b426
+                    .long 0x3f97012e
+                    .long 0x3f964fda
+                    .long 0x3f95a025
+                    .long 0x3f94f209
+                    .long 0x3f944581
+                    .long 0x3f939a86
+                    .long 0x3f92f114
+                    .long 0x3f924925
+                    .long 0x3f91a2b4
+                    .long 0x3f90fdbc
+                    .long 0x3f905a38
+                    .long 0x3f8fb824
+                    .long 0x3f8f177a
+                    .long 0x3f8e7835
+                    .long 0x3f8dda52
+                    .long 0x3f8d3dcb
+                    .long 0x3f8ca29c
+                    .long 0x3f8c08c1
+                    .long 0x3f8b7034
+                    .long 0x3f8ad8f3
+                    .long 0x3f8a42f8
+                    .long 0x3f89ae41
+                    .long 0x3f891ac7
+                    .long 0x3f888889
+                    .long 0x3f87f781
+                    .long 0x3f8767ab
+                    .long 0x3f86d905
+                    .long 0x3f864b8a
+                    .long 0x3f85bf37
+                    .long 0x3f853408
+                    .long 0x3f84a9fa
+                    .long 0x3f842108
+                    .long 0x3f839930
+                    .long 0x3f83126f
+                    .long 0x3f828cc0
+                    .long 0x3f820821
+                    .long 0x3f81848e
+                    .long 0x3f810204
+                    .long 0x3f808081
+                    .long 0x3f800000
+
+

diff --git a/src/gas/log2.S b/src/gas/log2.S
new file mode 100644
index 0000000..0c791b5
--- /dev/null
+++ b/src/gas/log2.S

@@ -0,0 +1,1132 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# log2.S
+#
+# An implementation of the log2 libm function.
+#
+# Prototype:
+#
+#     double log2(double x);
+#
+
+#
+#   Algorithm:
+#       Similar to one presnted in log.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(log2)
+#define fname_special _log2_special@PLT
+
+
+# local variable storage offsets
+.equ    p_temp, 0x0
+.equ    stack_size, 0x18
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+    sub         $stack_size, %rsp
+
+    # compute exponent part
+    xor         %rax, %rax
+    movdqa      %xmm0, %xmm3
+    movsd       %xmm0, %xmm4
+    psrlq       $52, %xmm3
+    movd        %xmm0, %rax
+    psubq       .L__mask_1023(%rip), %xmm3
+    movdqa      %xmm0, %xmm2
+    cvtdq2pd    %xmm3, %xmm6 # xexp
+
+    #  NaN or inf
+    movdqa      %xmm0, %xmm5
+    andpd       .L__real_inf(%rip), %xmm5
+    comisd      .L__real_inf(%rip), %xmm5
+    je          .L__x_is_inf_or_nan
+
+    # check for negative numbers or zero
+    xorpd       %xmm5, %xmm5
+    comisd      %xmm5, %xmm0
+    jbe         .L__x_is_zero_or_neg
+
+    pand        .L__real_mant(%rip), %xmm2
+    subsd       .L__real_one(%rip), %xmm4
+
+    comisd      .L__mask_1023_f(%rip), %xmm6
+    je          .L__denormal_adjust
+
+.L__continue_common:    
+
+    # compute index into the log tables
+    mov         %rax, %r9
+    and         .L__mask_mant_all8(%rip), %rax
+    and         .L__mask_mant9(%rip), %r9
+    shl         $1, %r9
+    add         %r9, %rax
+    mov         %rax, p_temp(%rsp)
+
+    # near one codepath
+    andpd       .L__real_notsign(%rip), %xmm4
+    comisd      .L__real_threshold(%rip), %xmm4
+    jb          .L__near_one
+
+    # F, Y
+    movsd       p_temp(%rsp), %xmm1
+    shr         $44, %rax
+    por         .L__real_half(%rip), %xmm2
+    por         .L__real_half(%rip), %xmm1
+    lea         .L__log_F_inv(%rip), %r9
+
+    # f = F - Y, r = f * inv
+    subsd       %xmm2, %xmm1
+    mulsd       (%r9,%rax,8), %xmm1
+
+    movsd       %xmm1, %xmm2
+    movsd       %xmm1, %xmm0
+    lea         .L__log_256_lead(%rip), %r9
+
+    # poly
+    movsd       .L__real_1_over_6(%rip), %xmm3
+    movsd       .L__real_1_over_3(%rip), %xmm1
+    mulsd       %xmm2, %xmm3                         
+    mulsd       %xmm2, %xmm1                         
+    mulsd       %xmm2, %xmm0                         
+    movsd       %xmm0, %xmm4
+    addsd       .L__real_1_over_5(%rip), %xmm3
+    addsd       .L__real_1_over_2(%rip), %xmm1
+    mulsd       %xmm0, %xmm4                         
+    mulsd       %xmm2, %xmm3                         
+    mulsd       %xmm0, %xmm1                         
+    addsd       .L__real_1_over_4(%rip), %xmm3
+    addsd       %xmm2, %xmm1                         
+    mulsd       %xmm4, %xmm3                         
+    addsd       %xmm3, %xmm1                         
+
+    mulsd       .L__real_log2_e(%rip), %xmm1
+
+    # m + log2(G) - poly*log2_e
+    movsd       (%r9,%rax,8), %xmm0
+    lea         .L__log_256_tail(%rip), %rdx
+    movsd       (%rdx,%rax,8), %xmm2
+    subsd       %xmm1, %xmm2
+   
+    addsd       %xmm6, %xmm0 
+    addsd       %xmm2, %xmm0
+
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__near_one:
+
+    # r = x - 1.0
+    movsd       .L__real_two(%rip), %xmm2
+    subsd       .L__real_one(%rip), %xmm0 # r
+
+    addsd       %xmm0, %xmm2
+    movsd       %xmm0, %xmm1
+    divsd       %xmm2, %xmm1 # r/(2+r) = u/2
+
+    movsd       .L__real_ca2(%rip), %xmm4
+    movsd       .L__real_ca4(%rip), %xmm5
+
+    movsd       %xmm0, %xmm6
+    mulsd       %xmm1, %xmm6 # correction
+
+    addsd       %xmm1, %xmm1 # u
+    movsd       %xmm1, %xmm2
+
+    mulsd       %xmm1, %xmm2 # u^2
+
+    mulsd       %xmm2, %xmm4
+    mulsd       %xmm2, %xmm5
+
+    addsd       .L__real_ca1(%rip), %xmm4
+    addsd       .L__real_ca3(%rip), %xmm5
+
+    mulsd       %xmm1, %xmm2 # u^3
+    mulsd       %xmm2, %xmm4
+
+    mulsd       %xmm2, %xmm2
+    mulsd       %xmm1, %xmm2 # u^7
+    mulsd       %xmm2, %xmm5
+
+    addsd       %xmm5, %xmm4
+    subsd       %xmm6, %xmm4
+
+    movdqa      %xmm0, %xmm3
+    pand        .L__mask_lower(%rip), %xmm3
+    subsd       %xmm3, %xmm0
+    addsd       %xmm0, %xmm4
+
+    movsd       %xmm3, %xmm0
+    movsd       %xmm4, %xmm1
+
+    mulsd       .L__real_log2_e_tail(%rip), %xmm4
+    mulsd       .L__real_log2_e_tail(%rip), %xmm0
+    mulsd       .L__real_log2_e_lead(%rip), %xmm1
+    mulsd       .L__real_log2_e_lead(%rip), %xmm3
+
+    addsd       %xmm4, %xmm0
+    addsd       %xmm1, %xmm0
+    addsd       %xmm3, %xmm0
+
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__denormal_adjust:
+    por         .L__real_one(%rip), %xmm2
+    subsd       .L__real_one(%rip), %xmm2
+    movsd       %xmm2, %xmm5
+    pand        .L__real_mant(%rip), %xmm2
+    movd        %xmm2, %rax
+    psrlq       $52, %xmm5
+    psubd       .L__mask_2045(%rip), %xmm5
+    cvtdq2pd    %xmm5, %xmm6
+    jmp         .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+    jne         .L__x_is_neg
+
+    movsd       .L__real_ninf(%rip), %xmm1
+    mov         .L__flag_x_zero(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+    movsd       .L__real_qnan(%rip), %xmm1
+    mov         .L__flag_x_neg(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+    cmp         .L__real_inf(%rip), %rax
+    je          .L__finish
+
+    cmp         .L__real_ninf(%rip), %rax
+    je          .L__x_is_neg
+
+    mov         .L__real_qnanbit(%rip), %r9
+    and         %rax, %r9
+    jnz         .L__finish
+
+    or          .L__real_qnanbit(%rip), %rax
+    movd        %rax, %xmm1
+    mov         .L__flag_x_nan(%rip), %edi
+    call        fname_special
+    jmp         .L__finish    
+
+.p2align 4,,15
+.L__finish:
+    add         $stack_size, %rsp
+    ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero:        .long 00000001
+.L__flag_x_neg:         .long 00000002
+.L__flag_x_nan:         .long 00000003
+
+.align 16
+
+.L__real_ninf:      .quad 0x0fff0000000000000   # -inf
+                    .quad 0x0000000000000000
+.L__real_inf:       .quad 0x7ff0000000000000    # +inf
+                    .quad 0x0000000000000000
+.L__real_qnan:      .quad 0x7ff8000000000000   # qNaN
+                    .quad 0x0000000000000000
+.L__real_qnanbit:   .quad 0x0008000000000000
+                    .quad 0x0000000000000000
+.L__real_mant:      .quad 0x000FFFFFFFFFFFFF    # mantissa bits
+                    .quad 0x0000000000000000
+.L__mask_1023:      .quad 0x00000000000003ff
+                    .quad 0x0000000000000000
+.L__mask_001:       .quad 0x0000000000000001
+                    .quad 0x0000000000000000
+
+.L__mask_mant_all8:     .quad 0x000ff00000000000
+                        .quad 0x0000000000000000
+.L__mask_mant9:         .quad 0x0000080000000000
+                        .quad 0x0000000000000000
+
+.L__real_log2_e:        .quad 0x3ff71547652b82fe
+                        .quad 0x0000000000000000
+
+.L__real_log2_e_lead:   .quad 0x3ff7154400000000 # log2e_lead 1.44269180297851562500E+00
+                        .quad 0x0000000000000000
+.L__real_log2_e_tail:   .quad 0x3ecb295c17f0bbbe # log2e_tail 3.23791044778235969970E-06
+                        .quad 0x0000000000000000
+
+.L__real_two:       .quad 0x4000000000000000 # 2
+                    .quad 0x0000000000000000
+
+.L__real_one:       .quad 0x3ff0000000000000 # 1
+                    .quad 0x0000000000000000
+
+.L__real_half:      .quad 0x3fe0000000000000 # 1/2
+                    .quad 0x0000000000000000
+
+.L__mask_100:       .quad 0x0000000000000100
+                    .quad 0x0000000000000000
+
+.L__real_1_over_512:    .quad 0x3f60000000000000
+                        .quad 0x0000000000000000
+
+.L__real_1_over_2:  .quad 0x3fe0000000000000
+                    .quad 0x0000000000000000
+.L__real_1_over_3:  .quad 0x3fd5555555555555
+                    .quad 0x0000000000000000
+.L__real_1_over_4:  .quad 0x3fd0000000000000
+                    .quad 0x0000000000000000
+.L__real_1_over_5:  .quad 0x3fc999999999999a
+                    .quad 0x0000000000000000
+.L__real_1_over_6:  .quad 0x3fc5555555555555
+                    .quad 0x0000000000000000
+
+.L__mask_1023_f:    .quad 0x0c08ff80000000000
+                    .quad 0x0000000000000000
+
+.L__mask_2045:      .quad 0x00000000000007fd
+                    .quad 0x0000000000000000
+
+.L__real_threshold: .quad 0x3fb0000000000000 # .0625
+                    .quad 0x0000000000000000
+
+.L__real_notsign:   .quad 0x7ffFFFFFFFFFFFFF # ^sign bit
+                    .quad 0x0000000000000000
+
+.L__real_ca1:       .quad 0x3fb55555555554e6 # 8.33333333333317923934e-02
+                    .quad 0x0000000000000000
+.L__real_ca2:       .quad 0x3f89999999bac6d4 # 1.25000000037717509602e-02
+                    .quad 0x0000000000000000
+.L__real_ca3:       .quad 0x3f62492307f1519f # 2.23213998791944806202e-03
+                    .quad 0x0000000000000000
+.L__real_ca4:       .quad 0x3f3c8034c85dfff0 # 4.34887777707614552256e-04
+                    .quad 0x0000000000000000
+
+.L__mask_lower:     .quad 0x0ffffffff00000000
+                    .quad 0x0000000000000000
+
+.align 16
+.L__log_256_lead:
+                    .quad 0x0000000000000000
+                    .quad 0x3f7709c460000000
+                    .quad 0x3f86fe50b0000000
+                    .quad 0x3f91363110000000
+                    .quad 0x3f96e79680000000
+                    .quad 0x3f9c9363b0000000
+                    .quad 0x3fa11cd1d0000000
+                    .quad 0x3fa3ed3090000000
+                    .quad 0x3fa6bad370000000
+                    .quad 0x3fa985bfc0000000
+                    .quad 0x3fac4dfab0000000
+                    .quad 0x3faf138980000000
+                    .quad 0x3fb0eb3890000000
+                    .quad 0x3fb24b5b70000000
+                    .quad 0x3fb3aa2fd0000000
+                    .quad 0x3fb507b830000000
+                    .quad 0x3fb663f6f0000000
+                    .quad 0x3fb7beee90000000
+                    .quad 0x3fb918a160000000
+                    .quad 0x3fba7111d0000000
+                    .quad 0x3fbbc84240000000
+                    .quad 0x3fbd1e34e0000000
+                    .quad 0x3fbe72ec10000000
+                    .quad 0x3fbfc66a00000000
+                    .quad 0x3fc08c5880000000
+                    .quad 0x3fc134e1b0000000
+                    .quad 0x3fc1dcd190000000
+                    .quad 0x3fc2842940000000
+                    .quad 0x3fc32ae9e0000000
+                    .quad 0x3fc3d11460000000
+                    .quad 0x3fc476a9f0000000
+                    .quad 0x3fc51bab90000000
+                    .quad 0x3fc5c01a30000000
+                    .quad 0x3fc663f6f0000000
+                    .quad 0x3fc70742d0000000
+                    .quad 0x3fc7a9fec0000000
+                    .quad 0x3fc84c2bd0000000
+                    .quad 0x3fc8edcae0000000
+                    .quad 0x3fc98edd00000000
+                    .quad 0x3fca2f6320000000
+                    .quad 0x3fcacf5e20000000
+                    .quad 0x3fcb6ecf10000000
+                    .quad 0x3fcc0db6c0000000
+                    .quad 0x3fccac1630000000
+                    .quad 0x3fcd49ee40000000
+                    .quad 0x3fcde73fe0000000
+                    .quad 0x3fce840be0000000
+                    .quad 0x3fcf205330000000
+                    .quad 0x3fcfbc16b0000000
+                    .quad 0x3fd02baba0000000
+                    .quad 0x3fd0790ad0000000
+                    .quad 0x3fd0c62970000000
+                    .quad 0x3fd11307d0000000
+                    .quad 0x3fd15fa670000000
+                    .quad 0x3fd1ac05b0000000
+                    .quad 0x3fd1f825f0000000
+                    .quad 0x3fd24407a0000000
+                    .quad 0x3fd28fab30000000
+                    .quad 0x3fd2db10f0000000
+                    .quad 0x3fd3263960000000
+                    .quad 0x3fd37124c0000000
+                    .quad 0x3fd3bbd3a0000000
+                    .quad 0x3fd4064630000000
+                    .quad 0x3fd4507cf0000000
+                    .quad 0x3fd49a7840000000
+                    .quad 0x3fd4e43880000000
+                    .quad 0x3fd52dbdf0000000
+                    .quad 0x3fd5770910000000
+                    .quad 0x3fd5c01a30000000
+                    .quad 0x3fd608f1b0000000
+                    .quad 0x3fd6518fe0000000
+                    .quad 0x3fd699f520000000
+                    .quad 0x3fd6e221c0000000
+                    .quad 0x3fd72a1630000000
+                    .quad 0x3fd771d2b0000000
+                    .quad 0x3fd7b957a0000000
+                    .quad 0x3fd800a560000000
+                    .quad 0x3fd847bc30000000
+                    .quad 0x3fd88e9c70000000
+                    .quad 0x3fd8d54670000000
+                    .quad 0x3fd91bba80000000
+                    .quad 0x3fd961f900000000
+                    .quad 0x3fd9a80230000000
+                    .quad 0x3fd9edd670000000
+                    .quad 0x3fda337600000000
+                    .quad 0x3fda78e140000000
+                    .quad 0x3fdabe1870000000
+                    .quad 0x3fdb031be0000000
+                    .quad 0x3fdb47ebf0000000
+                    .quad 0x3fdb8c88d0000000
+                    .quad 0x3fdbd0f2e0000000
+                    .quad 0x3fdc152a60000000
+                    .quad 0x3fdc592fa0000000
+                    .quad 0x3fdc9d02f0000000
+                    .quad 0x3fdce0a490000000
+                    .quad 0x3fdd2414c0000000
+                    .quad 0x3fdd6753e0000000
+                    .quad 0x3fddaa6220000000
+                    .quad 0x3fdded3fd0000000
+                    .quad 0x3fde2fed30000000
+                    .quad 0x3fde726aa0000000
+                    .quad 0x3fdeb4b840000000
+                    .quad 0x3fdef6d670000000
+                    .quad 0x3fdf38c560000000
+                    .quad 0x3fdf7a8560000000
+                    .quad 0x3fdfbc16b0000000
+                    .quad 0x3fdffd7990000000
+                    .quad 0x3fe01f5720000000
+                    .quad 0x3fe03fda80000000
+                    .quad 0x3fe0604710000000
+                    .quad 0x3fe0809cf0000000
+                    .quad 0x3fe0a0dc30000000
+                    .quad 0x3fe0c10500000000
+                    .quad 0x3fe0e11770000000
+                    .quad 0x3fe10113b0000000
+                    .quad 0x3fe120f9d0000000
+                    .quad 0x3fe140c9f0000000
+                    .quad 0x3fe1608440000000
+                    .quad 0x3fe18028c0000000
+                    .quad 0x3fe19fb7b0000000
+                    .quad 0x3fe1bf3110000000
+                    .quad 0x3fe1de9510000000
+                    .quad 0x3fe1fde3d0000000
+                    .quad 0x3fe21d1d50000000
+                    .quad 0x3fe23c41d0000000
+                    .quad 0x3fe25b5150000000
+                    .quad 0x3fe27a4c00000000
+                    .quad 0x3fe29931f0000000
+                    .quad 0x3fe2b80340000000
+                    .quad 0x3fe2d6c010000000
+                    .quad 0x3fe2f56870000000
+                    .quad 0x3fe313fc80000000
+                    .quad 0x3fe3327c60000000
+                    .quad 0x3fe350e830000000
+                    .quad 0x3fe36f3ff0000000
+                    .quad 0x3fe38d83e0000000
+                    .quad 0x3fe3abb3f0000000
+                    .quad 0x3fe3c9d060000000
+                    .quad 0x3fe3e7d930000000
+                    .quad 0x3fe405ce80000000
+                    .quad 0x3fe423b070000000
+                    .quad 0x3fe4417f20000000
+                    .quad 0x3fe45f3a90000000
+                    .quad 0x3fe47ce2f0000000
+                    .quad 0x3fe49a7840000000
+                    .quad 0x3fe4b7fab0000000
+                    .quad 0x3fe4d56a50000000
+                    .quad 0x3fe4f2c740000000
+                    .quad 0x3fe5101180000000
+                    .quad 0x3fe52d4940000000
+                    .quad 0x3fe54a6e80000000
+                    .quad 0x3fe5678170000000
+                    .quad 0x3fe5848220000000
+                    .quad 0x3fe5a170a0000000
+                    .quad 0x3fe5be4d00000000
+                    .quad 0x3fe5db1770000000
+                    .quad 0x3fe5f7cff0000000
+                    .quad 0x3fe61476a0000000
+                    .quad 0x3fe6310b80000000
+                    .quad 0x3fe64d8ed0000000
+                    .quad 0x3fe66a0080000000
+                    .quad 0x3fe68660c0000000
+                    .quad 0x3fe6a2af90000000
+                    .quad 0x3fe6beed20000000
+                    .quad 0x3fe6db1960000000
+                    .quad 0x3fe6f73480000000
+                    .quad 0x3fe7133e90000000
+                    .quad 0x3fe72f37a0000000
+                    .quad 0x3fe74b1fd0000000
+                    .quad 0x3fe766f720000000
+                    .quad 0x3fe782bdb0000000
+                    .quad 0x3fe79e73a0000000
+                    .quad 0x3fe7ba18f0000000
+                    .quad 0x3fe7d5adc0000000
+                    .quad 0x3fe7f13220000000
+                    .quad 0x3fe80ca620000000
+                    .quad 0x3fe82809d0000000
+                    .quad 0x3fe8435d50000000
+                    .quad 0x3fe85ea0b0000000
+                    .quad 0x3fe879d3f0000000
+                    .quad 0x3fe894f740000000
+                    .quad 0x3fe8b00aa0000000
+                    .quad 0x3fe8cb0e30000000
+                    .quad 0x3fe8e60200000000
+                    .quad 0x3fe900e610000000
+                    .quad 0x3fe91bba80000000
+                    .quad 0x3fe9367f60000000
+                    .quad 0x3fe95134d0000000
+                    .quad 0x3fe96bdad0000000
+                    .quad 0x3fe9867170000000
+                    .quad 0x3fe9a0f8d0000000
+                    .quad 0x3fe9bb70f0000000
+                    .quad 0x3fe9d5d9f0000000
+                    .quad 0x3fe9f033e0000000
+                    .quad 0x3fea0a7ed0000000
+                    .quad 0x3fea24bad0000000
+                    .quad 0x3fea3ee7f0000000
+                    .quad 0x3fea590640000000
+                    .quad 0x3fea7315d0000000
+                    .quad 0x3fea8d16b0000000
+                    .quad 0x3feaa708f0000000
+                    .quad 0x3feac0eca0000000
+                    .quad 0x3feadac1e0000000
+                    .quad 0x3feaf488b0000000
+                    .quad 0x3feb0e4120000000
+                    .quad 0x3feb27eb40000000
+                    .quad 0x3feb418730000000
+                    .quad 0x3feb5b14f0000000
+                    .quad 0x3feb749480000000
+                    .quad 0x3feb8e0620000000
+                    .quad 0x3feba769b0000000
+                    .quad 0x3febc0bf50000000
+                    .quad 0x3febda0710000000
+                    .quad 0x3febf34110000000
+                    .quad 0x3fec0c6d40000000
+                    .quad 0x3fec258bc0000000
+                    .quad 0x3fec3e9ca0000000
+                    .quad 0x3fec579fe0000000
+                    .quad 0x3fec7095a0000000
+                    .quad 0x3fec897df0000000
+                    .quad 0x3feca258d0000000
+                    .quad 0x3fecbb2660000000
+                    .quad 0x3fecd3e6a0000000
+                    .quad 0x3fecec9990000000
+                    .quad 0x3fed053f60000000
+                    .quad 0x3fed1dd810000000
+                    .quad 0x3fed3663b0000000
+                    .quad 0x3fed4ee240000000
+                    .quad 0x3fed6753e0000000
+                    .quad 0x3fed7fb890000000
+                    .quad 0x3fed981060000000
+                    .quad 0x3fedb05b60000000
+                    .quad 0x3fedc899a0000000
+                    .quad 0x3fede0cb30000000
+                    .quad 0x3fedf8f020000000
+                    .quad 0x3fee110860000000
+                    .quad 0x3fee291420000000
+                    .quad 0x3fee411360000000
+                    .quad 0x3fee590630000000
+                    .quad 0x3fee70eca0000000
+                    .quad 0x3fee88c6b0000000
+                    .quad 0x3feea09470000000
+                    .quad 0x3feeb855f0000000
+                    .quad 0x3feed00b40000000
+                    .quad 0x3feee7b470000000
+                    .quad 0x3feeff5180000000
+                    .quad 0x3fef16e280000000
+                    .quad 0x3fef2e6780000000
+                    .quad 0x3fef45e080000000
+                    .quad 0x3fef5d4da0000000
+                    .quad 0x3fef74aef0000000
+                    .quad 0x3fef8c0460000000
+                    .quad 0x3fefa34e10000000
+                    .quad 0x3fefba8c00000000
+                    .quad 0x3fefd1be40000000
+                    .quad 0x3fefe8e4f0000000
+                    .quad 0x3ff0000000000000
+
+.align 16
+.L__log_256_tail:
+                    .quad 0x0000000000000000
+                    .quad 0x3deaf558ee95b37a
+                    .quad 0x3debbc2145fe38de
+                    .quad 0x3dfea5ec312ed069
+                    .quad 0x3df70b48a629b89f
+                    .quad 0x3e050a1f0cccdd01
+                    .quad 0x3e044cd04bb60514
+                    .quad 0x3e01a16898809d2d
+                    .quad 0x3e063bf61cc4d81b
+                    .quad 0x3dfa4a8ca305071d
+                    .quad 0x3e121556bde9f1f0
+                    .quad 0x3df9929cfd0e6835
+                    .quad 0x3e2f453f35679ee9
+                    .quad 0x3e2c26b47913459e
+                    .quad 0x3e2a4fe385b009a2
+                    .quad 0x3e180ceedb53cb4d
+                    .quad 0x3e2592262cf998a7
+                    .quad 0x3e1ae28a04f106b8
+                    .quad 0x3e2c8c66b55ce464
+                    .quad 0x3e2e690927d688b0
+                    .quad 0x3de5b5774c7658b4
+                    .quad 0x3e0adc16d26859c7
+                    .quad 0x3df7fa5b21cbdb5d
+                    .quad 0x3e2e160149209a68
+                    .quad 0x3e39b4f3c72c4f78
+                    .quad 0x3e222418b7fcd690
+                    .quad 0x3e2d54aded7a9150
+                    .quad 0x3e360f4c7f1aed15
+                    .quad 0x3e13c570d0fa8f96
+                    .quad 0x3e3b3514c7e0166e
+                    .quad 0x3e3307ee9a6271d2
+                    .quad 0x3dee9722922c0226
+                    .quad 0x3e33f7ad0f3f4016
+                    .quad 0x3e3592262cf998a7
+                    .quad 0x3e23bc09fca70073
+                    .quad 0x3e2f41777bc5f936
+                    .quad 0x3dd781d97ee91247
+                    .quad 0x3e306a56d76b9a84
+                    .quad 0x3e2df9c37c0beb3a
+                    .quad 0x3e1905c35651c429
+                    .quad 0x3e3b69d927dfc23d
+                    .quad 0x3e2d7e57a5afb633
+                    .quad 0x3e3bb29bdc81c4db
+                    .quad 0x3e38ee1b912d8994
+                    .quad 0x3e3864b2df91e96a
+                    .quad 0x3e1d8a40770df213
+                    .quad 0x3e2d39a9331f27cf
+                    .quad 0x3e32411e4e8eea54
+                    .quad 0x3e3204d0144751b3
+                    .quad 0x3e2268331dd8bd0b
+                    .quad 0x3e47606012de0634
+                    .quad 0x3e3550aa3a93ec7e
+                    .quad 0x3e45a616eb9612e0
+                    .quad 0x3e3aec23fd65f8e1
+                    .quad 0x3e248f838294639c
+                    .quad 0x3e3b62384cafa1a3
+                    .quad 0x3e461c0e73048b72
+                    .quad 0x3e36cc9a0d8c0e85
+                    .quad 0x3e489b355ede26f4
+                    .quad 0x3e2b5941acd71f1e
+                    .quad 0x3e4d499bd9b32266
+                    .quad 0x3e043b9f52b061ba
+                    .quad 0x3e46360892eb65e6
+                    .quad 0x3e4dba9f8729ab41
+                    .quad 0x3e479a3715fc9257
+                    .quad 0x3e0d1f6d3f77ae38
+                    .quad 0x3e48992d66fb9ec1
+                    .quad 0x3e4666f195620f03
+                    .quad 0x3e43f7ad0f3f4016
+                    .quad 0x3e30a522b65bc039
+                    .quad 0x3e319dee9b9489e3
+                    .quad 0x3e323352e1a31521
+                    .quad 0x3e4b3a19bcaf1aa4
+                    .quad 0x3e3f2f060a50d366
+                    .quad 0x3e44fdf677c8dfd9
+                    .quad 0x3e48a35588aec6df
+                    .quad 0x3e28b0e2a19575b0
+                    .quad 0x3e2ec30c6e3e04a7
+                    .quad 0x3e2705912d25b325
+                    .quad 0x3e2dae1b8d59e849
+                    .quad 0x3e423e2e1169656a
+                    .quad 0x3e349d026e33d675
+                    .quad 0x3e423c465e6976da
+                    .quad 0x3e366c977e236c73
+                    .quad 0x3e44fec0a13af881
+                    .quad 0x3e3bdefbd14a0816
+                    .quad 0x3e42fe3e91c348e4
+                    .quad 0x3e4fc0c868ccc02d
+                    .quad 0x3e3ce20a829051bb
+                    .quad 0x3e47f10cf32e6bba
+                    .quad 0x3e43cf2061568859
+                    .quad 0x3e484995cb804b94
+                    .quad 0x3e4a52b6acfcfdca
+                    .quad 0x3e3b291ecf4dff1e
+                    .quad 0x3e21d2c3e64ae851
+                    .quad 0x3e4017e4faa42b7d
+                    .quad 0x3de975077f1f5f0c
+                    .quad 0x3e20327dc8093a52
+                    .quad 0x3e3108d9313aec65
+                    .quad 0x3e4a12e5301be44a
+                    .quad 0x3e1e754d20c519e1
+                    .quad 0x3e3f456f394f9727
+                    .quad 0x3e29471103e8f00d
+                    .quad 0x3e3ef3150343f8df
+                    .quad 0x3e41960d9d9c3263
+                    .quad 0x3e4204d0144751b3
+                    .quad 0x3e4507ff357398fe
+                    .quad 0x3e4dc9937fc8cafd
+                    .quad 0x3e572f32fe672868
+                    .quad 0x3e53e49d647d323e
+                    .quad 0x3e33fb81ea92d9e0
+                    .quad 0x3e43e387ef003635
+                    .quad 0x3e1ac754cb104aea
+                    .quad 0x3e4535f0444ebaaf
+                    .quad 0x3e253c8ea7b1cdda
+                    .quad 0x3e3cf0c0396a568b
+                    .quad 0x3e5543ca873c2b4a
+                    .quad 0x3e425780181e2b37
+                    .quad 0x3e5ee52ed49d71d2
+                    .quad 0x3e51e64842e2c386
+                    .quad 0x3e5d2ba01bc76a27
+                    .quad 0x3e5b39774c30f499
+                    .quad 0x3e38740932120aea
+                    .quad 0x3e576dab3462a1e8
+                    .quad 0x3e409c9f20203b31
+                    .quad 0x3e516e7a08ad0d1a
+                    .quad 0x3e46172fe015e13b
+                    .quad 0x3e49e4558147cf67
+                    .quad 0x3e4cfdeb43cfd005
+                    .quad 0x3e3a809c03254a71
+                    .quad 0x3e47acfc98509e33
+                    .quad 0x3e54366de473e474
+                    .quad 0x3e5569394d90d724
+                    .quad 0x3e32b83ec743664c
+                    .quad 0x3e56db22c4808ee5
+                    .quad 0x3df7ae84940df0e1
+                    .quad 0x3e554042cd999564
+                    .quad 0x3e4242b8488b3056
+                    .quad 0x3e4e7dc059ab8a9e
+                    .quad 0x3e5a71e977d7da5f
+                    .quad 0x3e5d30d552ce0ec3
+                    .quad 0x3e43208592b6c6b7
+                    .quad 0x3e51440e7149afff
+                    .quad 0x3e36812c371a1c87
+                    .quad 0x3e579a3715fc9257
+                    .quad 0x3e57c92f2af8b0ca
+                    .quad 0x3e56679d8894dbdf
+                    .quad 0x3e2a9f33e77507f0
+                    .quad 0x3e4c22a3e377a524
+                    .quad 0x3e3723c84a77a4dc
+                    .quad 0x3e594a871b636194
+                    .quad 0x3e570d6058f62f4d
+                    .quad 0x3e4a6274cf0e362f
+                    .quad 0x3e42fe3570af1a0b
+                    .quad 0x3e596a286955d67e
+                    .quad 0x3e442104f127091e
+                    .quad 0x3e407826bae32c6b
+                    .quad 0x3df8d8844ce77237
+                    .quad 0x3e5eaa609080d4b4
+                    .quad 0x3e4dc66fbe61efc4
+                    .quad 0x3e5c8f11979a5db6
+                    .quad 0x3e52dedf0e6f1770
+                    .quad 0x3e5cb41e1410132a
+                    .quad 0x3e32866d705c553d
+                    .quad 0x3e54ec3293b2fbe0
+                    .quad 0x3e578b8c2f4d0fe1
+                    .quad 0x3e562ad8f7ca2cff
+                    .quad 0x3e5a298b5f973a2c
+                    .quad 0x3e49381d4f1b95e0
+                    .quad 0x3e564c7bdb9bc56c
+                    .quad 0x3e5fbb4caef790fc
+                    .quad 0x3e51200c3f899927
+                    .quad 0x3e526a05c813d56e
+                    .quad 0x3e4681e2910108ee
+                    .quad 0x3e282cf15d12ecd7
+                    .quad 0x3e0a537e32446892
+                    .quad 0x3e46f9c1cb6f7010
+                    .quad 0x3e4328ddcedf39d8
+                    .quad 0x3e164f64c210df9d
+                    .quad 0x3e58f676e17cc811
+                    .quad 0x3e560ddf1680dd45
+                    .quad 0x3e5e2da951c2d91b
+                    .quad 0x3e5696777b66d115
+                    .quad 0x3e311eb3043f5601
+                    .quad 0x3e48000b33f90fd4
+                    .quad 0x3e523e2e1169656a
+                    .quad 0x3e5b41565d3990cb
+                    .quad 0x3e46138b8d9d31e6
+                    .quad 0x3e3565afaf7f6248
+                    .quad 0x3e4b68e0ba153594
+                    .quad 0x3e3d87027ef4ab9a
+                    .quad 0x3e556b9c99085939
+                    .quad 0x3e5aa02166cccab2
+                    .quad 0x3e5991d2aca399a1
+                    .quad 0x3e54982259cc625d
+                    .quad 0x3e4b9feddaab9820
+                    .quad 0x3e3c70c0f683cc68
+                    .quad 0x3e213156425e67e5
+                    .quad 0x3df79063deab051f
+                    .quad 0x3e27e2744b2b8ca5
+                    .quad 0x3e4600534df378df
+                    .quad 0x3e59322676507a79
+                    .quad 0x3e4c4720cb4558b5
+                    .quad 0x3e445e4b56add63a
+                    .quad 0x3e4af321af5e9bb5
+                    .quad 0x3e57f1e1148dad64
+                    .quad 0x3e42a4022f65e2e6
+                    .quad 0x3e11f2ccbcd0d3cc
+                    .quad 0x3e5eaa65b49696e2
+                    .quad 0x3e110e6123a74764
+                    .quad 0x3e3cf24b2077c3f6
+                    .quad 0x3e4fc8d8164754da
+                    .quad 0x3e598cfcdb6a2dbc
+                    .quad 0x3e24464a6bcdf47b
+                    .quad 0x3e41f1774d8b66a6
+                    .quad 0x3e459920a2adf6fa
+                    .quad 0x3e370d02a99b4c5a
+                    .quad 0x3e576b6cafa2532d
+                    .quad 0x3e5d23c38ec17936
+                    .quad 0x3e541b6b1b0e66c4
+                    .quad 0x3e5952662c6bfdc7
+                    .quad 0x3e4333f3d6bb35ec
+                    .quad 0x3e195120d8486e92
+                    .quad 0x3e5db8a405fac56e
+                    .quad 0x3e5a4c112ce6312e
+                    .quad 0x3e536987e1924e45
+                    .quad 0x3e33f98ea94bc1bd
+                    .quad 0x3e459718aacb6ec7
+                    .quad 0x3df975077f1f5f0c
+                    .quad 0x3e13654d88f20500
+                    .quad 0x3e40f598530f101b
+                    .quad 0x3e5145f6c94f7fd7
+                    .quad 0x3e567fead8bcce75
+                    .quad 0x3e52e67148cd0a7b
+                    .quad 0x3e10d5e5897de907
+                    .quad 0x3e5b5ee92c53d919
+                    .quad 0x3e5c1c02803f7554
+                    .quad 0x3e5d5caa7a35c9f7
+                    .quad 0x3e5910459cac3223
+                    .quad 0x3e41fbe1bb98afdf
+                    .quad 0x3e3b135395510d1e
+                    .quad 0x3e47b8f0e7b8e757
+                    .quad 0x3e519511f61a96b8
+                    .quad 0x3e5117d846ae1f8e
+                    .quad 0x3e2b3a9507d6dc1f
+                    .quad 0x3e15fa7c78c9e676
+                    .quad 0x3e2db76303b21928
+                    .quad 0x3e27eb8450ac22ed
+                    .quad 0x3e579e0caa9c9ab7
+                    .quad 0x3e59de6d7cba1bbe
+                    .quad 0x3e1df5f5baf436cb
+                    .quad 0x3e3e746344728dbf
+                    .quad 0x3e277c23362928b9
+                    .quad 0x3e4715137cfeba9f
+                    .quad 0x3e58fe55f2856443
+                    .quad 0x3e25bd1a025d9e24
+                    .quad 0x0000000000000000
+
+.align 16
+.L__log_F_inv:
+                    .quad 0x4000000000000000
+                    .quad 0x3fffe01fe01fe020
+                    .quad 0x3fffc07f01fc07f0
+                    .quad 0x3fffa11caa01fa12
+                    .quad 0x3fff81f81f81f820
+                    .quad 0x3fff6310aca0dbb5
+                    .quad 0x3fff44659e4a4271
+                    .quad 0x3fff25f644230ab5
+                    .quad 0x3fff07c1f07c1f08
+                    .quad 0x3ffee9c7f8458e02
+                    .quad 0x3ffecc07b301ecc0
+                    .quad 0x3ffeae807aba01eb
+                    .quad 0x3ffe9131abf0b767
+                    .quad 0x3ffe741aa59750e4
+                    .quad 0x3ffe573ac901e574
+                    .quad 0x3ffe3a9179dc1a73
+                    .quad 0x3ffe1e1e1e1e1e1e
+                    .quad 0x3ffe01e01e01e01e
+                    .quad 0x3ffde5d6e3f8868a
+                    .quad 0x3ffdca01dca01dca
+                    .quad 0x3ffdae6076b981db
+                    .quad 0x3ffd92f2231e7f8a
+                    .quad 0x3ffd77b654b82c34
+                    .quad 0x3ffd5cac807572b2
+                    .quad 0x3ffd41d41d41d41d
+                    .quad 0x3ffd272ca3fc5b1a
+                    .quad 0x3ffd0cb58f6ec074
+                    .quad 0x3ffcf26e5c44bfc6
+                    .quad 0x3ffcd85689039b0b
+                    .quad 0x3ffcbe6d9601cbe7
+                    .quad 0x3ffca4b3055ee191
+                    .quad 0x3ffc8b265afb8a42
+                    .quad 0x3ffc71c71c71c71c
+                    .quad 0x3ffc5894d10d4986
+                    .quad 0x3ffc3f8f01c3f8f0
+                    .quad 0x3ffc26b5392ea01c
+                    .quad 0x3ffc0e070381c0e0
+                    .quad 0x3ffbf583ee868d8b
+                    .quad 0x3ffbdd2b899406f7
+                    .quad 0x3ffbc4fd65883e7b
+                    .quad 0x3ffbacf914c1bad0
+                    .quad 0x3ffb951e2b18ff23
+                    .quad 0x3ffb7d6c3dda338b
+                    .quad 0x3ffb65e2e3beee05
+                    .quad 0x3ffb4e81b4e81b4f
+                    .quad 0x3ffb37484ad806ce
+                    .quad 0x3ffb2036406c80d9
+                    .quad 0x3ffb094b31d922a4
+                    .quad 0x3ffaf286bca1af28
+                    .quad 0x3ffadbe87f94905e
+                    .quad 0x3ffac5701ac5701b
+                    .quad 0x3ffaaf1d2f87ebfd
+                    .quad 0x3ffa98ef606a63be
+                    .quad 0x3ffa82e65130e159
+                    .quad 0x3ffa6d01a6d01a6d
+                    .quad 0x3ffa574107688a4a
+                    .quad 0x3ffa41a41a41a41a
+                    .quad 0x3ffa2c2a87c51ca0
+                    .quad 0x3ffa16d3f97a4b02
+                    .quad 0x3ffa01a01a01a01a
+                    .quad 0x3ff9ec8e951033d9
+                    .quad 0x3ff9d79f176b682d
+                    .quad 0x3ff9c2d14ee4a102
+                    .quad 0x3ff9ae24ea5510da
+                    .quad 0x3ff999999999999a
+                    .quad 0x3ff9852f0d8ec0ff
+                    .quad 0x3ff970e4f80cb872
+                    .quad 0x3ff95cbb0be377ae
+                    .quad 0x3ff948b0fcd6e9e0
+                    .quad 0x3ff934c67f9b2ce6
+                    .quad 0x3ff920fb49d0e229
+                    .quad 0x3ff90d4f120190d5
+                    .quad 0x3ff8f9c18f9c18fa
+                    .quad 0x3ff8e6527af1373f
+                    .quad 0x3ff8d3018d3018d3
+                    .quad 0x3ff8bfce8062ff3a
+                    .quad 0x3ff8acb90f6bf3aa
+                    .quad 0x3ff899c0f601899c
+                    .quad 0x3ff886e5f0abb04a
+                    .quad 0x3ff87427bcc092b9
+                    .quad 0x3ff8618618618618
+                    .quad 0x3ff84f00c2780614
+                    .quad 0x3ff83c977ab2bedd
+                    .quad 0x3ff82a4a0182a4a0
+                    .quad 0x3ff8181818181818
+                    .quad 0x3ff8060180601806
+                    .quad 0x3ff7f405fd017f40
+                    .quad 0x3ff7e225515a4f1d
+                    .quad 0x3ff7d05f417d05f4
+                    .quad 0x3ff7beb3922e017c
+                    .quad 0x3ff7ad2208e0ecc3
+                    .quad 0x3ff79baa6bb6398b
+                    .quad 0x3ff78a4c8178a4c8
+                    .quad 0x3ff77908119ac60d
+                    .quad 0x3ff767dce434a9b1
+                    .quad 0x3ff756cac201756d
+                    .quad 0x3ff745d1745d1746
+                    .quad 0x3ff734f0c541fe8d
+                    .quad 0x3ff724287f46debc
+                    .quad 0x3ff713786d9c7c09
+                    .quad 0x3ff702e05c0b8170
+                    .quad 0x3ff6f26016f26017
+                    .quad 0x3ff6e1f76b4337c7
+                    .quad 0x3ff6d1a62681c861
+                    .quad 0x3ff6c16c16c16c17
+                    .quad 0x3ff6b1490aa31a3d
+                    .quad 0x3ff6a13cd1537290
+                    .quad 0x3ff691473a88d0c0
+                    .quad 0x3ff6816816816817
+                    .quad 0x3ff6719f3601671a
+                    .quad 0x3ff661ec6a5122f9
+                    .quad 0x3ff6524f853b4aa3
+                    .quad 0x3ff642c8590b2164
+                    .quad 0x3ff63356b88ac0de
+                    .quad 0x3ff623fa77016240
+                    .quad 0x3ff614b36831ae94
+                    .quad 0x3ff6058160581606
+                    .quad 0x3ff5f66434292dfc
+                    .quad 0x3ff5e75bb8d015e7
+                    .quad 0x3ff5d867c3ece2a5
+                    .quad 0x3ff5c9882b931057
+                    .quad 0x3ff5babcc647fa91
+                    .quad 0x3ff5ac056b015ac0
+                    .quad 0x3ff59d61f123ccaa
+                    .quad 0x3ff58ed2308158ed
+                    .quad 0x3ff5805601580560
+                    .quad 0x3ff571ed3c506b3a
+                    .quad 0x3ff56397ba7c52e2
+                    .quad 0x3ff5555555555555
+                    .quad 0x3ff54725e6bb82fe
+                    .quad 0x3ff5390948f40feb
+                    .quad 0x3ff52aff56a8054b
+                    .quad 0x3ff51d07eae2f815
+                    .quad 0x3ff50f22e111c4c5
+                    .quad 0x3ff5015015015015
+                    .quad 0x3ff4f38f62dd4c9b
+                    .quad 0x3ff4e5e0a72f0539
+                    .quad 0x3ff4d843bedc2c4c
+                    .quad 0x3ff4cab88725af6e
+                    .quad 0x3ff4bd3edda68fe1
+                    .quad 0x3ff4afd6a052bf5b
+                    .quad 0x3ff4a27fad76014a
+                    .quad 0x3ff49539e3b2d067
+                    .quad 0x3ff4880522014880
+                    .quad 0x3ff47ae147ae147b
+                    .quad 0x3ff46dce34596066
+                    .quad 0x3ff460cbc7f5cf9a
+                    .quad 0x3ff453d9e2c776ca
+                    .quad 0x3ff446f86562d9fb
+                    .quad 0x3ff43a2730abee4d
+                    .quad 0x3ff42d6625d51f87
+                    .quad 0x3ff420b5265e5951
+                    .quad 0x3ff4141414141414
+                    .quad 0x3ff40782d10e6566
+                    .quad 0x3ff3fb013fb013fb
+                    .quad 0x3ff3ee8f42a5af07
+                    .quad 0x3ff3e22cbce4a902
+                    .quad 0x3ff3d5d991aa75c6
+                    .quad 0x3ff3c995a47babe7
+                    .quad 0x3ff3bd60d9232955
+                    .quad 0x3ff3b13b13b13b14
+                    .quad 0x3ff3a524387ac822
+                    .quad 0x3ff3991c2c187f63
+                    .quad 0x3ff38d22d366088e
+                    .quad 0x3ff3813813813814
+                    .quad 0x3ff3755bd1c945ee
+                    .quad 0x3ff3698df3de0748
+                    .quad 0x3ff35dce5f9f2af8
+                    .quad 0x3ff3521cfb2b78c1
+                    .quad 0x3ff34679ace01346
+                    .quad 0x3ff33ae45b57bcb2
+                    .quad 0x3ff32f5ced6a1dfa
+                    .quad 0x3ff323e34a2b10bf
+                    .quad 0x3ff3187758e9ebb6
+                    .quad 0x3ff30d190130d190
+                    .quad 0x3ff301c82ac40260
+                    .quad 0x3ff2f684bda12f68
+                    .quad 0x3ff2eb4ea1fed14b
+                    .quad 0x3ff2e025c04b8097
+                    .quad 0x3ff2d50a012d50a0
+                    .quad 0x3ff2c9fb4d812ca0
+                    .quad 0x3ff2bef98e5a3711
+                    .quad 0x3ff2b404ad012b40
+                    .quad 0x3ff2a91c92f3c105
+                    .quad 0x3ff29e4129e4129e
+                    .quad 0x3ff293725bb804a5
+                    .quad 0x3ff288b01288b013
+                    .quad 0x3ff27dfa38a1ce4d
+                    .quad 0x3ff27350b8812735
+                    .quad 0x3ff268b37cd60127
+                    .quad 0x3ff25e22708092f1
+                    .quad 0x3ff2539d7e9177b2
+                    .quad 0x3ff2492492492492
+                    .quad 0x3ff23eb79717605b
+                    .quad 0x3ff23456789abcdf
+                    .quad 0x3ff22a0122a0122a
+                    .quad 0x3ff21fb78121fb78
+                    .quad 0x3ff21579804855e6
+                    .quad 0x3ff20b470c67c0d9
+                    .quad 0x3ff2012012012012
+                    .quad 0x3ff1f7047dc11f70
+                    .quad 0x3ff1ecf43c7fb84c
+                    .quad 0x3ff1e2ef3b3fb874
+                    .quad 0x3ff1d8f5672e4abd
+                    .quad 0x3ff1cf06ada2811d
+                    .quad 0x3ff1c522fc1ce059
+                    .quad 0x3ff1bb4a4046ed29
+                    .quad 0x3ff1b17c67f2bae3
+                    .quad 0x3ff1a7b9611a7b96
+                    .quad 0x3ff19e0119e0119e
+                    .quad 0x3ff19453808ca29c
+                    .quad 0x3ff18ab083902bdb
+                    .quad 0x3ff1811811811812
+                    .quad 0x3ff1778a191bd684
+                    .quad 0x3ff16e0689427379
+                    .quad 0x3ff1648d50fc3201
+                    .quad 0x3ff15b1e5f75270d
+                    .quad 0x3ff151b9a3fdd5c9
+                    .quad 0x3ff1485f0e0acd3b
+                    .quad 0x3ff13f0e8d344724
+                    .quad 0x3ff135c81135c811
+                    .quad 0x3ff12c8b89edc0ac
+                    .quad 0x3ff12358e75d3033
+                    .quad 0x3ff11a3019a74826
+                    .quad 0x3ff1111111111111
+                    .quad 0x3ff107fbbe011080
+                    .quad 0x3ff0fef010fef011
+                    .quad 0x3ff0f5edfab325a2
+                    .quad 0x3ff0ecf56be69c90
+                    .quad 0x3ff0e40655826011
+                    .quad 0x3ff0db20a88f4696
+                    .quad 0x3ff0d24456359e3a
+                    .quad 0x3ff0c9714fbcda3b
+                    .quad 0x3ff0c0a7868b4171
+                    .quad 0x3ff0b7e6ec259dc8
+                    .quad 0x3ff0af2f722eecb5
+                    .quad 0x3ff0a6810a6810a7
+                    .quad 0x3ff09ddba6af8360
+                    .quad 0x3ff0953f39010954
+                    .quad 0x3ff08cabb37565e2
+                    .quad 0x3ff0842108421084
+                    .quad 0x3ff07b9f29b8eae2
+                    .quad 0x3ff073260a47f7c6
+                    .quad 0x3ff06ab59c7912fb
+                    .quad 0x3ff0624dd2f1a9fc
+                    .quad 0x3ff059eea0727586
+                    .quad 0x3ff05197f7d73404
+                    .quad 0x3ff04949cc1664c5
+                    .quad 0x3ff0410410410410
+                    .quad 0x3ff038c6b78247fc
+                    .quad 0x3ff03091b51f5e1a
+                    .quad 0x3ff02864fc7729e9
+                    .quad 0x3ff0204081020408
+                    .quad 0x3ff0182436517a37
+                    .quad 0x3ff0101010101010
+                    .quad 0x3ff0080402010080
+                    .quad 0x3ff0000000000000
+                    .quad 0x0000000000000000
+
+

diff --git a/src/gas/log2f.S b/src/gas/log2f.S
new file mode 100644
index 0000000..5361e0f
--- /dev/null
+++ b/src/gas/log2f.S

@@ -0,0 +1,738 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# log2f.S
+#
+# An implementation of the log2f libm function.
+#
+# Prototype:
+#
+#     float log2f(float x);
+#
+
+#
+#   Algorithm:
+#       Similar to one presnted in log.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(log2f)
+#define fname_special _log2f_special@PLT
+
+
+# local variable storage offsets
+.equ    p_temp, 0x0
+.equ    stack_size, 0x18
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+    sub         $stack_size, %rsp
+
+    # compute exponent part
+    xor         %eax, %eax
+    movdqa      %xmm0, %xmm3
+    movss       %xmm0, %xmm4
+    psrld       $23, %xmm3
+    movd        %xmm0, %eax
+    psubd       .L__mask_127(%rip), %xmm3
+    movdqa      %xmm0, %xmm2
+    cvtdq2ps    %xmm3, %xmm5 # xexp
+
+    #  NaN or inf
+    movdqa      %xmm0, %xmm1
+    andps       .L__real_inf(%rip), %xmm1
+    comiss      .L__real_inf(%rip), %xmm1
+    je          .L__x_is_inf_or_nan
+
+    # check for negative numbers or zero
+    xorps       %xmm1, %xmm1
+    comiss      %xmm1, %xmm0
+    jbe         .L__x_is_zero_or_neg
+
+    pand        .L__real_mant(%rip), %xmm2
+    subss       .L__real_one(%rip), %xmm4
+
+    comiss      .L__real_neg127(%rip), %xmm5
+    je          .L__denormal_adjust
+
+.L__continue_common:
+
+    # compute index into the log tables
+    mov         %eax, %r9d
+    and         .L__mask_mant_all7(%rip), %eax
+    and         .L__mask_mant8(%rip), %r9d
+    shl         $1, %r9d
+    add         %r9d, %eax
+    mov         %eax, p_temp(%rsp)
+
+    # near one codepath
+    andps       .L__real_notsign(%rip), %xmm4
+    comiss      .L__real_threshold(%rip), %xmm4
+    jb          .L__near_one
+
+    # F, Y
+    movss       p_temp(%rsp), %xmm1
+    shr         $16, %eax
+    por         .L__real_half(%rip), %xmm2
+    por         .L__real_half(%rip), %xmm1
+    lea         .L__log_F_inv(%rip), %r9
+
+    # f = F - Y, r = f * inv
+    subss       %xmm2, %xmm1
+    mulss       (%r9,%rax,4), %xmm1
+
+    movss       %xmm1, %xmm2
+    movss       %xmm1, %xmm0
+
+    # poly
+    mulss       .L__real_1_over_3(%rip), %xmm2
+    mulss       %xmm1, %xmm0
+    addss       .L__real_1_over_2(%rip), %xmm2
+
+    lea         .L__log_128_tail(%rip), %r9
+    lea         .L__log_128_lead(%rip), %r10
+
+    mulss       %xmm0, %xmm2
+    movss       (%r9,%rax,4), %xmm3
+    addss       %xmm2, %xmm1
+
+    mulss       .L__real_log2_e(%rip), %xmm1
+
+    # m + log2(G) - poly*log2_e
+    subss       %xmm1, %xmm3 
+    movss       %xmm3, %xmm0
+    addss       (%r10,%rax,4), %xmm5
+    addss       %xmm5, %xmm0
+
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__near_one:
+    # r = x - 1.0#
+    movss       .L__real_two(%rip), %xmm2
+    subss       .L__real_one(%rip), %xmm0
+
+    # u = r / (2.0 + r)
+    addss       %xmm0, %xmm2
+    movss       %xmm0, %xmm1
+    divss       %xmm2, %xmm1 # u
+
+    # correction = r * u
+    movss       %xmm0, %xmm4
+    mulss       %xmm1, %xmm4
+
+    # u = u + u#
+    addss       %xmm1, %xmm1
+    movss       %xmm1, %xmm2
+    mulss       %xmm2, %xmm2 # v = u^2
+
+    # r2 = (u * v * (ca_1 + v * ca_2) - correction)
+    movss       %xmm1, %xmm3
+    mulss       %xmm2, %xmm3 # u^3
+    mulss       .L__real_ca2(%rip), %xmm2 # Bu^2
+    addss       .L__real_ca1(%rip), %xmm2 # +A
+    mulss       %xmm3, %xmm2
+    subss       %xmm4, %xmm2 # -correction
+
+    movdqa      %xmm0, %xmm5
+    pand        .L__mask_lower(%rip), %xmm5
+    subss       %xmm5, %xmm0
+    addss       %xmm0, %xmm2
+
+    movss       %xmm5, %xmm0
+    movss       %xmm2, %xmm1
+
+    mulss       .L__real_log2_e_tail(%rip), %xmm2
+    mulss       .L__real_log2_e_tail(%rip), %xmm0
+    mulss       .L__real_log2_e_lead(%rip), %xmm1
+    mulss       .L__real_log2_e_lead(%rip), %xmm5
+
+    addss       %xmm2, %xmm0
+    addss       %xmm1, %xmm0
+    addss       %xmm5, %xmm0
+
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__denormal_adjust:
+    por         .L__real_one(%rip), %xmm2
+    subss       .L__real_one(%rip), %xmm2
+    movdqa      %xmm2, %xmm5
+    pand        .L__real_mant(%rip), %xmm2
+    movd        %xmm2, %eax
+    psrld       $23, %xmm5
+    psubd       .L__mask_253(%rip), %xmm5
+    cvtdq2ps    %xmm5, %xmm5
+    jmp         .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+    jne         .L__x_is_neg
+
+    movss       .L__real_ninf(%rip), %xmm1
+    mov         .L__flag_x_zero(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+    movss       .L__real_nan(%rip), %xmm1
+    mov         .L__flag_x_neg(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+    cmp         .L__real_inf(%rip), %eax
+    je          .L__finish
+
+    cmp         .L__real_ninf(%rip), %eax
+    je          .L__x_is_neg
+
+    mov         .L__real_qnanbit(%rip), %r9d
+    and         %eax, %r9d
+    jnz         .L__finish
+
+    or          .L__real_qnanbit(%rip), %eax
+    movd        %eax, %xmm1
+    mov         .L__flag_x_nan(%rip), %edi
+    call        fname_special
+    jmp         .L__finish    
+
+.p2align 4,,15
+.L__finish:
+    add         $stack_size, %rsp
+    ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero:        .long 00000001
+.L__flag_x_neg:         .long 00000002
+.L__flag_x_nan:         .long 00000003
+
+.align 16
+
+.L__real_one:           .quad 0x03f8000003f800000   # 1.0
+                        .quad 0x03f8000003f800000
+.L__real_two:           .quad 0x04000000040000000   # 1.0
+                        .quad 0x04000000040000000
+.L__real_ninf:          .quad 0x0ff800000ff800000   # -inf
+                        .quad 0x0ff800000ff800000
+.L__real_inf:           .quad 0x07f8000007f800000   # +inf
+                        .quad 0x07f8000007f800000
+.L__real_nan:           .quad 0x07fc000007fc00000   # NaN
+                        .quad 0x07fc000007fc00000
+.L__real_ef:            .quad 0x0402DF854402DF854   # float e
+                        .quad 0x0402DF854402DF854
+.L__real_neg_qnan:      .quad 0x0ffc00000ffc00000
+                        .quad 0x0ffc00000ffc00000
+
+.L__real_sign:          .quad 0x08000000080000000   # sign bit
+                        .quad 0x08000000080000000
+.L__real_notsign:       .quad 0x07ffFFFFF7ffFFFFF   # ^sign bit
+                        .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:       .quad 0x00040000000400000   # quiet nan bit
+                        .quad 0x00040000000400000
+.L__real_mant:          .quad 0x0007FFFFF007FFFFF   # mantissa bits
+                        .quad 0x0007FFFFF007FFFFF
+.L__mask_127:           .quad 0x00000007f0000007f   # 
+                        .quad 0x00000007f0000007f
+
+.L__mask_mant_all7:     .quad 0x00000000007f0000
+                        .quad 0x00000000007f0000
+.L__mask_mant8:         .quad 0x0000000000008000
+                        .quad 0x0000000000008000
+
+.L__real_ca1:           .quad 0x03DAAAAAB3DAAAAAB   # 8.33333333333317923934e-02
+                        .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:           .quad 0x03C4CCCCD3C4CCCCD   # 1.25000000037717509602e-02
+                        .quad 0x03C4CCCCD3C4CCCCD
+
+.L__real_log2_lead:     .quad 0x03F3170003F317000   # 0.693115234375
+                        .quad 0x03F3170003F317000
+.L__real_log2_tail:     .quad 0x03805FDF43805FDF4   # 0.000031946183
+                        .quad 0x03805FDF43805FDF4
+.L__real_half:          .quad 0x03f0000003f000000   # 1/2
+                        .quad 0x03f0000003f000000
+
+.L__real_log2_e_lead:   .quad 0x03FB800003FB80000   # 1.4375000000
+                        .quad 0x03FB800003FB80000
+.L__real_log2_e_tail:   .quad 0x03BAA3B293BAA3B29   # 0.0051950408889633
+                        .quad 0x03BAA3B293BAA3B29
+
+.L__real_log2_e:        .quad 0x3fb8aa3b3fb8aa3b
+                        .quad 0x0000000000000000
+
+.L__mask_lower:         .quad 0x0ffff0000ffff0000
+                        .quad 0x0ffff0000ffff0000
+
+.align 16
+
+.L__real_neg127:    .long 0x0c2fe0000
+                    .long 0
+                    .quad 0
+
+.L__mask_253:       .long 0x000000fd
+                    .long 0
+                    .quad 0
+
+.L__real_threshold: .long 0x3d800000
+                    .long 0
+                    .quad 0
+
+.L__mask_01:        .long 0x00000001
+                    .long 0
+                    .quad 0
+
+.L__mask_80:        .long 0x00000080
+                    .long 0
+                    .quad 0
+
+.L__real_3b800000:  .long 0x3b800000
+                    .long 0
+                    .quad 0
+
+.L__real_1_over_3:  .long 0x3eaaaaab
+                    .long 0
+                    .quad 0
+
+.L__real_1_over_2:  .long 0x3f000000
+                    .long 0
+                    .quad 0
+
+.align 16
+.L__log_128_lead:
+                    .long 0x00000000
+                    .long 0x3c37c000
+                    .long 0x3cb70000
+                    .long 0x3d08c000
+                    .long 0x3d35c000
+                    .long 0x3d624000
+                    .long 0x3d874000
+                    .long 0x3d9d4000
+                    .long 0x3db30000
+                    .long 0x3dc8c000
+                    .long 0x3dde4000
+                    .long 0x3df38000
+                    .long 0x3e044000
+                    .long 0x3e0ec000
+                    .long 0x3e194000
+                    .long 0x3e238000
+                    .long 0x3e2e0000
+                    .long 0x3e380000
+                    .long 0x3e424000
+                    .long 0x3e4c4000
+                    .long 0x3e564000
+                    .long 0x3e604000
+                    .long 0x3e6a4000
+                    .long 0x3e740000
+                    .long 0x3e7dc000
+                    .long 0x3e83c000
+                    .long 0x3e888000
+                    .long 0x3e8d4000
+                    .long 0x3e920000
+                    .long 0x3e96c000
+                    .long 0x3e9b8000
+                    .long 0x3ea00000
+                    .long 0x3ea4c000
+                    .long 0x3ea94000
+                    .long 0x3eae0000
+                    .long 0x3eb28000
+                    .long 0x3eb70000
+                    .long 0x3ebb8000
+                    .long 0x3ec00000
+                    .long 0x3ec44000
+                    .long 0x3ec8c000
+                    .long 0x3ecd4000
+                    .long 0x3ed18000
+                    .long 0x3ed5c000
+                    .long 0x3eda0000
+                    .long 0x3ede8000
+                    .long 0x3ee2c000
+                    .long 0x3ee70000
+                    .long 0x3eeb0000
+                    .long 0x3eef4000
+                    .long 0x3ef38000
+                    .long 0x3ef78000
+                    .long 0x3efbc000
+                    .long 0x3effc000
+                    .long 0x3f01c000
+                    .long 0x3f040000
+                    .long 0x3f060000
+                    .long 0x3f080000
+                    .long 0x3f0a0000
+                    .long 0x3f0c0000
+                    .long 0x3f0dc000
+                    .long 0x3f0fc000
+                    .long 0x3f11c000
+                    .long 0x3f13c000
+                    .long 0x3f15c000
+                    .long 0x3f178000
+                    .long 0x3f198000
+                    .long 0x3f1b4000
+                    .long 0x3f1d4000
+                    .long 0x3f1f0000
+                    .long 0x3f210000
+                    .long 0x3f22c000
+                    .long 0x3f24c000
+                    .long 0x3f268000
+                    .long 0x3f288000
+                    .long 0x3f2a4000
+                    .long 0x3f2c0000
+                    .long 0x3f2dc000
+                    .long 0x3f2f8000
+                    .long 0x3f318000
+                    .long 0x3f334000
+                    .long 0x3f350000
+                    .long 0x3f36c000
+                    .long 0x3f388000
+                    .long 0x3f3a4000
+                    .long 0x3f3c0000
+                    .long 0x3f3dc000
+                    .long 0x3f3f8000
+                    .long 0x3f414000
+                    .long 0x3f42c000
+                    .long 0x3f448000
+                    .long 0x3f464000
+                    .long 0x3f480000
+                    .long 0x3f498000
+                    .long 0x3f4b4000
+                    .long 0x3f4d0000
+                    .long 0x3f4e8000
+                    .long 0x3f504000
+                    .long 0x3f51c000
+                    .long 0x3f538000
+                    .long 0x3f550000
+                    .long 0x3f56c000
+                    .long 0x3f584000
+                    .long 0x3f5a0000
+                    .long 0x3f5b8000
+                    .long 0x3f5d0000
+                    .long 0x3f5ec000
+                    .long 0x3f604000
+                    .long 0x3f61c000
+                    .long 0x3f638000
+                    .long 0x3f650000
+                    .long 0x3f668000
+                    .long 0x3f680000
+                    .long 0x3f698000
+                    .long 0x3f6b0000
+                    .long 0x3f6cc000
+                    .long 0x3f6e4000
+                    .long 0x3f6fc000
+                    .long 0x3f714000
+                    .long 0x3f72c000
+                    .long 0x3f744000
+                    .long 0x3f75c000
+                    .long 0x3f770000
+                    .long 0x3f788000
+                    .long 0x3f7a0000
+                    .long 0x3f7b8000
+                    .long 0x3f7d0000
+                    .long 0x3f7e8000
+                    .long 0x3f800000
+
+.align 16
+.L__log_128_tail:
+                    .long 0x00000000
+                    .long 0x374a16dd
+                    .long 0x37f2d0b8
+                    .long 0x381a3aa2
+                    .long 0x37b4dd63
+                    .long 0x383f5721
+                    .long 0x384e27e8
+                    .long 0x380bf749
+                    .long 0x387dbeb2
+                    .long 0x37216e46
+                    .long 0x3684815b
+                    .long 0x383b045f
+                    .long 0x390b119b
+                    .long 0x391a32ea
+                    .long 0x38ba789e
+                    .long 0x39553f30
+                    .long 0x3651cfde
+                    .long 0x39685a9d
+                    .long 0x39057a05
+                    .long 0x395ba0ef
+                    .long 0x396bc5b6
+                    .long 0x3936d9bb
+                    .long 0x38772619
+                    .long 0x39017ce9
+                    .long 0x3902d720
+                    .long 0x38856dd8
+                    .long 0x3941f6b4
+                    .long 0x3980b652
+                    .long 0x3980f561
+                    .long 0x39443f13
+                    .long 0x38926752
+                    .long 0x39c8c763
+                    .long 0x391e12f3
+                    .long 0x39b7bf89
+                    .long 0x36d1cfde
+                    .long 0x38c7f233
+                    .long 0x39087367
+                    .long 0x38e95d3f
+                    .long 0x38256316
+                    .long 0x39d38e5c
+                    .long 0x396ea247
+                    .long 0x350e4788
+                    .long 0x395d829f
+                    .long 0x39c30f2f
+                    .long 0x39fd7ee7
+                    .long 0x3872e9e7
+                    .long 0x3897d694
+                    .long 0x3824923a
+                    .long 0x39ea7c06
+                    .long 0x39a7fa88
+                    .long 0x391aa879
+                    .long 0x39dace65
+                    .long 0x39215a32
+                    .long 0x39af3350
+                    .long 0x3a7b5172
+                    .long 0x389cf27f
+                    .long 0x3902806b
+                    .long 0x3909d8a9
+                    .long 0x38c9faa1
+                    .long 0x37a33dca
+                    .long 0x3a6623d2
+                    .long 0x3a3c7a61
+                    .long 0x3a083a84
+                    .long 0x39930161
+                    .long 0x35d1cfde
+                    .long 0x3a2d0ebd
+                    .long 0x399f1aad
+                    .long 0x3a67ff6d
+                    .long 0x39ecfea8
+                    .long 0x3a7b26f3
+                    .long 0x39ec1fa6
+                    .long 0x3a675314
+                    .long 0x399e12f3
+                    .long 0x3a2d4b66
+                    .long 0x370c3845
+                    .long 0x399ba329
+                    .long 0x3a1044d3
+                    .long 0x3a49a196
+                    .long 0x3a79fe83
+                    .long 0x3905c7aa
+                    .long 0x39802391
+                    .long 0x39abe796
+                    .long 0x39c65a9d
+                    .long 0x39cfa6c5
+                    .long 0x39c7f593
+                    .long 0x39af6ff7
+                    .long 0x39863e4d
+                    .long 0x391910c1
+                    .long 0x369d5be7
+                    .long 0x3a541616
+                    .long 0x3a1ee960
+                    .long 0x39c38ed2
+                    .long 0x38e61600
+                    .long 0x3a4fedb4
+                    .long 0x39f6b4ab
+                    .long 0x38f8d3b0
+                    .long 0x3a3b3faa
+                    .long 0x399fb693
+                    .long 0x3a5cfe71
+                    .long 0x39c5740b
+                    .long 0x3a611eb0
+                    .long 0x39b079c4
+                    .long 0x3a4824d7
+                    .long 0x39439a54
+                    .long 0x3a1291ea
+                    .long 0x3a6d3673
+                    .long 0x3981c731
+                    .long 0x3a0da88f
+                    .long 0x3a53945c
+                    .long 0x3895ae91
+                    .long 0x3996372a
+                    .long 0x39f9a832
+                    .long 0x3a27eda4
+                    .long 0x3a4c764f
+                    .long 0x3a6a7c06
+                    .long 0x370321eb
+                    .long 0x3899ab3f
+                    .long 0x38f02086
+                    .long 0x390a1707
+                    .long 0x39031e44
+                    .long 0x38c6b362
+                    .long 0x382bf195
+                    .long 0x3a768e36
+                    .long 0x3a5c503b
+                    .long 0x3a3c1179
+                    .long 0x3a15de1d
+                    .long 0x39d3845d
+                    .long 0x395f263f
+                    .long 0x00000000
+
+.align 16
+.L__log_F_inv:
+                    .long 0x40000000
+                    .long 0x3ffe03f8
+                    .long 0x3ffc0fc1
+                    .long 0x3ffa232d
+                    .long 0x3ff83e10
+                    .long 0x3ff6603e
+                    .long 0x3ff4898d
+                    .long 0x3ff2b9d6
+                    .long 0x3ff0f0f1
+                    .long 0x3fef2eb7
+                    .long 0x3fed7304
+                    .long 0x3febbdb3
+                    .long 0x3fea0ea1
+                    .long 0x3fe865ac
+                    .long 0x3fe6c2b4
+                    .long 0x3fe52598
+                    .long 0x3fe38e39
+                    .long 0x3fe1fc78
+                    .long 0x3fe07038
+                    .long 0x3fdee95c
+                    .long 0x3fdd67c9
+                    .long 0x3fdbeb62
+                    .long 0x3fda740e
+                    .long 0x3fd901b2
+                    .long 0x3fd79436
+                    .long 0x3fd62b81
+                    .long 0x3fd4c77b
+                    .long 0x3fd3680d
+                    .long 0x3fd20d21
+                    .long 0x3fd0b6a0
+                    .long 0x3fcf6475
+                    .long 0x3fce168a
+                    .long 0x3fcccccd
+                    .long 0x3fcb8728
+                    .long 0x3fca4588
+                    .long 0x3fc907da
+                    .long 0x3fc7ce0c
+                    .long 0x3fc6980c
+                    .long 0x3fc565c8
+                    .long 0x3fc43730
+                    .long 0x3fc30c31
+                    .long 0x3fc1e4bc
+                    .long 0x3fc0c0c1
+                    .long 0x3fbfa030
+                    .long 0x3fbe82fa
+                    .long 0x3fbd6910
+                    .long 0x3fbc5264
+                    .long 0x3fbb3ee7
+                    .long 0x3fba2e8c
+                    .long 0x3fb92144
+                    .long 0x3fb81703
+                    .long 0x3fb70fbb
+                    .long 0x3fb60b61
+                    .long 0x3fb509e7
+                    .long 0x3fb40b41
+                    .long 0x3fb30f63
+                    .long 0x3fb21643
+                    .long 0x3fb11fd4
+                    .long 0x3fb02c0b
+                    .long 0x3faf3ade
+                    .long 0x3fae4c41
+                    .long 0x3fad602b
+                    .long 0x3fac7692
+                    .long 0x3fab8f6a
+                    .long 0x3faaaaab
+                    .long 0x3fa9c84a
+                    .long 0x3fa8e83f
+                    .long 0x3fa80a81
+                    .long 0x3fa72f05
+                    .long 0x3fa655c4
+                    .long 0x3fa57eb5
+                    .long 0x3fa4a9cf
+                    .long 0x3fa3d70a
+                    .long 0x3fa3065e
+                    .long 0x3fa237c3
+                    .long 0x3fa16b31
+                    .long 0x3fa0a0a1
+                    .long 0x3f9fd80a
+                    .long 0x3f9f1166
+                    .long 0x3f9e4cad
+                    .long 0x3f9d89d9
+                    .long 0x3f9cc8e1
+                    .long 0x3f9c09c1
+                    .long 0x3f9b4c70
+                    .long 0x3f9a90e8
+                    .long 0x3f99d723
+                    .long 0x3f991f1a
+                    .long 0x3f9868c8
+                    .long 0x3f97b426
+                    .long 0x3f97012e
+                    .long 0x3f964fda
+                    .long 0x3f95a025
+                    .long 0x3f94f209
+                    .long 0x3f944581
+                    .long 0x3f939a86
+                    .long 0x3f92f114
+                    .long 0x3f924925
+                    .long 0x3f91a2b4
+                    .long 0x3f90fdbc
+                    .long 0x3f905a38
+                    .long 0x3f8fb824
+                    .long 0x3f8f177a
+                    .long 0x3f8e7835
+                    .long 0x3f8dda52
+                    .long 0x3f8d3dcb
+                    .long 0x3f8ca29c
+                    .long 0x3f8c08c1
+                    .long 0x3f8b7034
+                    .long 0x3f8ad8f3
+                    .long 0x3f8a42f8
+                    .long 0x3f89ae41
+                    .long 0x3f891ac7
+                    .long 0x3f888889
+                    .long 0x3f87f781
+                    .long 0x3f8767ab
+                    .long 0x3f86d905
+                    .long 0x3f864b8a
+                    .long 0x3f85bf37
+                    .long 0x3f853408
+                    .long 0x3f84a9fa
+                    .long 0x3f842108
+                    .long 0x3f839930
+                    .long 0x3f83126f
+                    .long 0x3f828cc0
+                    .long 0x3f820821
+                    .long 0x3f81848e
+                    .long 0x3f810204
+                    .long 0x3f808081
+                    .long 0x3f800000
+
+

diff --git a/src/gas/logf.S b/src/gas/logf.S
new file mode 100644
index 0000000..4cee0b0
--- /dev/null
+++ b/src/gas/logf.S

@@ -0,0 +1,725 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# logf.S
+#
+# An implementation of the logf libm function.
+#
+# Prototype:
+#
+#     float logf(float x);
+#
+
+#
+#   Algorithm:
+#       Similar to one presnted in log.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(logf)
+#define fname_special _logf_special@PLT
+
+
+# local variable storage offsets
+.equ    p_temp, 0x0
+.equ    stack_size, 0x18
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+    sub         $stack_size, %rsp
+
+    # compute exponent part
+    xor         %eax, %eax
+    movdqa      %xmm0, %xmm3
+    movss       %xmm0, %xmm4
+    psrld       $23, %xmm3
+    movd        %xmm0, %eax
+    psubd       .L__mask_127(%rip), %xmm3
+    movdqa      %xmm0, %xmm2
+    cvtdq2ps    %xmm3, %xmm5 # xexp
+
+    #  NaN or inf
+    movdqa      %xmm0, %xmm1
+    andps       .L__real_inf(%rip), %xmm1
+    comiss      .L__real_inf(%rip), %xmm1
+    je          .L__x_is_inf_or_nan
+
+    # check for negative numbers or zero
+    xorps       %xmm1, %xmm1
+    comiss      %xmm1, %xmm0
+    jbe         .L__x_is_zero_or_neg
+
+    pand        .L__real_mant(%rip), %xmm2
+    subss       .L__real_one(%rip), %xmm4
+
+    comiss      .L__real_neg127(%rip), %xmm5
+    je          .L__denormal_adjust
+
+.L__continue_common:
+
+    # compute the index into the log tables
+    mov         %eax, %r9d
+    and         .L__mask_mant_all7(%rip), %eax
+    and         .L__mask_mant8(%rip), %r9d
+    shl         $1, %r9d
+    add         %r9d, %eax
+    mov         %eax, p_temp(%rsp)
+
+    # check e as a special case
+    comiss      .L__real_ef(%rip), %xmm0
+    je          .L__logf_e
+
+    # near one codepath
+    andps       .L__real_notsign(%rip), %xmm4
+    comiss      .L__real_threshold(%rip), %xmm4
+    jb          .L__near_one
+
+    # F, Y
+    movss       p_temp(%rsp), %xmm1
+    shr         $16, %eax
+    por         .L__real_half(%rip), %xmm2
+    por         .L__real_half(%rip), %xmm1
+    lea         .L__log_F_inv(%rip), %r9
+
+    # f = F - Y, r = f * inv
+    subss       %xmm2, %xmm1
+    mulss       (%r9,%rax,4), %xmm1
+
+    movss       %xmm1, %xmm2
+    movss       %xmm1, %xmm0
+
+    # poly
+    mulss       .L__real_1_over_3(%rip), %xmm2
+    mulss       %xmm1, %xmm0
+    addss       .L__real_1_over_2(%rip), %xmm2
+    movss       .L__real_log2_tail(%rip), %xmm3
+
+    lea         .L__log_128_tail(%rip), %r9
+    lea         .L__log_128_lead(%rip), %r10
+
+    mulss       %xmm0, %xmm2
+    mulss       %xmm5, %xmm3
+    addss       %xmm2, %xmm1
+
+    # m*log(2) + log(G) - poly
+    movss       .L__real_log2_lead(%rip), %xmm0
+    subss       %xmm1, %xmm3 # z2
+    mulss       %xmm5, %xmm0
+    addss       (%r9,%rax,4), %xmm3 # z2
+    addss       (%r10,%rax,4), %xmm0 # z1
+
+    addss       %xmm3, %xmm0
+
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__logf_e:
+    movss       .L__real_one(%rip), %xmm0
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__near_one:
+    # r = x - 1.0#
+    movss       .L__real_two(%rip), %xmm2
+    subss       .L__real_one(%rip), %xmm0
+
+    # u = r / (2.0 + r)
+    addss       %xmm0, %xmm2
+    movss       %xmm0, %xmm1
+    divss       %xmm2, %xmm1 # u
+
+    # correction = r * u
+    movss       %xmm0, %xmm4
+    mulss       %xmm1, %xmm4
+
+    # u = u + u#
+    addss       %xmm1, %xmm1
+    movss       %xmm1, %xmm2
+    mulss       %xmm2, %xmm2 # v = u^2
+
+    # r2 = (u * v * (ca_1 + v * ca_2) - correction)
+    movss       %xmm1, %xmm3
+    mulss       %xmm2, %xmm3 # u^3
+    mulss       .L__real_ca2(%rip), %xmm2 # Bu^2
+    addss       .L__real_ca1(%rip), %xmm2 # +A
+    mulss       %xmm3, %xmm2
+    subss       %xmm4, %xmm2 # -correction
+
+    # r + r2
+    addss       %xmm2, %xmm0
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__denormal_adjust:
+    por         .L__real_one(%rip), %xmm2
+    subss       .L__real_one(%rip), %xmm2
+    movdqa      %xmm2, %xmm5
+    pand        .L__real_mant(%rip), %xmm2
+    movd        %xmm2, %eax
+    psrld       $23, %xmm5
+    psubd       .L__mask_253(%rip), %xmm5
+    cvtdq2ps    %xmm5, %xmm5
+    jmp         .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+    jne         .L__x_is_neg
+
+    movss       .L__real_ninf(%rip), %xmm1
+    mov         .L__flag_x_zero(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+    movss       .L__real_nan(%rip), %xmm1
+    mov         .L__flag_x_neg(%rip), %edi
+    call        fname_special
+    jmp         .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+    cmp         .L__real_inf(%rip), %eax
+    je          .L__finish
+
+    cmp         .L__real_ninf(%rip), %eax
+    je          .L__x_is_neg
+
+    mov         .L__real_qnanbit(%rip), %r9d
+    and         %eax, %r9d
+    jnz         .L__finish
+
+    or          .L__real_qnanbit(%rip), %eax
+    movd        %eax, %xmm1
+    mov         .L__flag_x_nan(%rip), %edi
+    call        fname_special
+    jmp         .L__finish    
+
+.p2align 4,,15
+.L__finish:
+    add         $stack_size, %rsp
+    ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero:        .long 00000001
+.L__flag_x_neg:         .long 00000002
+.L__flag_x_nan:         .long 00000003
+
+.align 16
+
+.L__real_one:           .quad 0x03f8000003f800000   # 1.0
+                        .quad 0x03f8000003f800000
+.L__real_two:           .quad 0x04000000040000000   # 1.0
+                        .quad 0x04000000040000000
+.L__real_ninf:          .quad 0x0ff800000ff800000   # -inf
+                        .quad 0x0ff800000ff800000
+.L__real_inf:           .quad 0x07f8000007f800000   # +inf
+                        .quad 0x07f8000007f800000
+.L__real_nan:           .quad 0x07fc000007fc00000   # NaN
+                        .quad 0x07fc000007fc00000
+.L__real_ef:            .quad 0x0402DF854402DF854   # float e
+                        .quad 0x0402DF854402DF854
+.L__real_neg_qnan:      .quad 0x0ffc00000ffc00000
+                        .quad 0x0ffc00000ffc00000
+
+.L__real_sign:          .quad 0x08000000080000000   # sign bit
+                        .quad 0x08000000080000000
+.L__real_notsign:       .quad 0x07ffFFFFF7ffFFFFF   # ^sign bit
+                        .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:       .quad 0x00040000000400000   # quiet nan bit
+                        .quad 0x00040000000400000
+.L__real_mant:          .quad 0x0007FFFFF007FFFFF   # mantissa bits
+                        .quad 0x0007FFFFF007FFFFF
+.L__mask_127:           .quad 0x00000007f0000007f   # 
+                        .quad 0x00000007f0000007f
+
+.L__mask_mant_all7:     .quad 0x00000000007f0000
+                        .quad 0x00000000007f0000
+.L__mask_mant8:         .quad 0x0000000000008000
+                        .quad 0x0000000000008000
+
+.L__real_ca1:           .quad 0x03DAAAAAB3DAAAAAB   # 8.33333333333317923934e-02
+                        .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:           .quad 0x03C4CCCCD3C4CCCCD   # 1.25000000037717509602e-02
+                        .quad 0x03C4CCCCD3C4CCCCD
+
+.L__real_log2_lead:     .quad 0x03F3170003F317000   # 0.693115234375
+                        .quad 0x03F3170003F317000
+.L__real_log2_tail:     .quad 0x03805FDF43805FDF4   # 0.000031946183
+                        .quad 0x03805FDF43805FDF4
+.L__real_half:          .quad 0x03f0000003f000000   # 1/2
+                        .quad 0x03f0000003f000000
+
+
+.align 16
+
+.L__real_neg127:    .long 0x0c2fe0000
+                    .long 0
+                    .quad 0
+
+.L__mask_253:       .long 0x000000fd
+                    .long 0
+                    .quad 0
+
+.L__real_threshold: .long 0x3d800000
+                    .long 0
+                    .quad 0
+
+.L__mask_01:        .long 0x00000001
+                    .long 0
+                    .quad 0
+
+.L__mask_80:        .long 0x00000080
+                    .long 0
+                    .quad 0
+
+.L__real_3b800000:  .long 0x3b800000
+                    .long 0
+                    .quad 0
+
+.L__real_1_over_3:  .long 0x3eaaaaab
+                    .long 0
+                    .quad 0
+
+.L__real_1_over_2:  .long 0x3f000000
+                    .long 0
+                    .quad 0
+
+.align 16
+.L__log_128_lead:
+                    .long 0x00000000
+                    .long 0x3bff0000
+                    .long 0x3c7e0000
+                    .long 0x3cbdc000
+                    .long 0x3cfc1000
+                    .long 0x3d1cf000
+                    .long 0x3d3ba000
+                    .long 0x3d5a1000
+                    .long 0x3d785000
+                    .long 0x3d8b2000
+                    .long 0x3d9a0000
+                    .long 0x3da8d000
+                    .long 0x3db78000
+                    .long 0x3dc61000
+                    .long 0x3dd49000
+                    .long 0x3de2f000
+                    .long 0x3df13000
+                    .long 0x3dff6000
+                    .long 0x3e06b000
+                    .long 0x3e0db000
+                    .long 0x3e14a000
+                    .long 0x3e1b8000
+                    .long 0x3e226000
+                    .long 0x3e293000
+                    .long 0x3e2ff000
+                    .long 0x3e36b000
+                    .long 0x3e3d5000
+                    .long 0x3e43f000
+                    .long 0x3e4a9000
+                    .long 0x3e511000
+                    .long 0x3e579000
+                    .long 0x3e5e1000
+                    .long 0x3e647000
+                    .long 0x3e6ae000
+                    .long 0x3e713000
+                    .long 0x3e778000
+                    .long 0x3e7dc000
+                    .long 0x3e820000
+                    .long 0x3e851000
+                    .long 0x3e882000
+                    .long 0x3e8b3000
+                    .long 0x3e8e4000
+                    .long 0x3e914000
+                    .long 0x3e944000
+                    .long 0x3e974000
+                    .long 0x3e9a3000
+                    .long 0x3e9d3000
+                    .long 0x3ea02000
+                    .long 0x3ea30000
+                    .long 0x3ea5f000
+                    .long 0x3ea8d000
+                    .long 0x3eabb000
+                    .long 0x3eae8000
+                    .long 0x3eb16000
+                    .long 0x3eb43000
+                    .long 0x3eb70000
+                    .long 0x3eb9c000
+                    .long 0x3ebc9000
+                    .long 0x3ebf5000
+                    .long 0x3ec21000
+                    .long 0x3ec4d000
+                    .long 0x3ec78000
+                    .long 0x3eca3000
+                    .long 0x3ecce000
+                    .long 0x3ecf9000
+                    .long 0x3ed24000
+                    .long 0x3ed4e000
+                    .long 0x3ed78000
+                    .long 0x3eda2000
+                    .long 0x3edcc000
+                    .long 0x3edf5000
+                    .long 0x3ee1e000
+                    .long 0x3ee47000
+                    .long 0x3ee70000
+                    .long 0x3ee99000
+                    .long 0x3eec1000
+                    .long 0x3eeea000
+                    .long 0x3ef12000
+                    .long 0x3ef3a000
+                    .long 0x3ef61000
+                    .long 0x3ef89000
+                    .long 0x3efb0000
+                    .long 0x3efd7000
+                    .long 0x3effe000
+                    .long 0x3f012000
+                    .long 0x3f025000
+                    .long 0x3f039000
+                    .long 0x3f04c000
+                    .long 0x3f05f000
+                    .long 0x3f072000
+                    .long 0x3f084000
+                    .long 0x3f097000
+                    .long 0x3f0aa000
+                    .long 0x3f0bc000
+                    .long 0x3f0cf000
+                    .long 0x3f0e1000
+                    .long 0x3f0f4000
+                    .long 0x3f106000
+                    .long 0x3f118000
+                    .long 0x3f12a000
+                    .long 0x3f13c000
+                    .long 0x3f14e000
+                    .long 0x3f160000
+                    .long 0x3f172000
+                    .long 0x3f183000
+                    .long 0x3f195000
+                    .long 0x3f1a7000
+                    .long 0x3f1b8000
+                    .long 0x3f1c9000
+                    .long 0x3f1db000
+                    .long 0x3f1ec000
+                    .long 0x3f1fd000
+                    .long 0x3f20e000
+                    .long 0x3f21f000
+                    .long 0x3f230000
+                    .long 0x3f241000
+                    .long 0x3f252000
+                    .long 0x3f263000
+                    .long 0x3f273000
+                    .long 0x3f284000
+                    .long 0x3f295000
+                    .long 0x3f2a5000
+                    .long 0x3f2b5000
+                    .long 0x3f2c6000
+                    .long 0x3f2d6000
+                    .long 0x3f2e6000
+                    .long 0x3f2f7000
+                    .long 0x3f307000
+                    .long 0x3f317000
+
+.align 16
+.L__log_128_tail:
+                    .long 0x00000000
+                    .long 0x3429ac41
+                    .long 0x35a8b0fc
+                    .long 0x368d83ea
+                    .long 0x361b0e78
+                    .long 0x3687b9fe
+                    .long 0x3631ec65
+                    .long 0x36dd7119
+                    .long 0x35c30045
+                    .long 0x379b7751
+                    .long 0x37ebcb0d
+                    .long 0x37839f83
+                    .long 0x37528ae5
+                    .long 0x37a2eb18
+                    .long 0x36da7495
+                    .long 0x36a91eb7
+                    .long 0x3783b715
+                    .long 0x371131db
+                    .long 0x383f3e68
+                    .long 0x38156a97
+                    .long 0x38297c0f
+                    .long 0x387e100f
+                    .long 0x3815b665
+                    .long 0x37e5e3a1
+                    .long 0x38183853
+                    .long 0x35fe719d
+                    .long 0x38448108
+                    .long 0x38503290
+                    .long 0x373539e8
+                    .long 0x385e0ff1
+                    .long 0x3864a740
+                    .long 0x3786742d
+                    .long 0x387be3cd
+                    .long 0x3685ad3e
+                    .long 0x3803b715
+                    .long 0x37adcbdc
+                    .long 0x380c36af
+                    .long 0x371652d3
+                    .long 0x38927139
+                    .long 0x38c5fcd7
+                    .long 0x38ae55d5
+                    .long 0x3818c169
+                    .long 0x38a0fde7
+                    .long 0x38ad09ef
+                    .long 0x3862bae1
+                    .long 0x38eecd4c
+                    .long 0x3798aad2
+                    .long 0x37421a1a
+                    .long 0x38c5e10e
+                    .long 0x37bf2aee
+                    .long 0x382d872d
+                    .long 0x37ee2e8a
+                    .long 0x38dedfac
+                    .long 0x3802f2b9
+                    .long 0x38481e9b
+                    .long 0x380eaa2b
+                    .long 0x38ebfb5d
+                    .long 0x38255fdd
+                    .long 0x38783b82
+                    .long 0x3851da1e
+                    .long 0x374e1b05
+                    .long 0x388f439b
+                    .long 0x38ca0e10
+                    .long 0x38cac08b
+                    .long 0x3891f65f
+                    .long 0x378121cb
+                    .long 0x386c9a9a
+                    .long 0x38949923
+                    .long 0x38777bcc
+                    .long 0x37b12d26
+                    .long 0x38a6ced3
+                    .long 0x38ebd3e6
+                    .long 0x38fbe3cd
+                    .long 0x38d785c2
+                    .long 0x387e7e00
+                    .long 0x38f392c5
+                    .long 0x37d40983
+                    .long 0x38081a7c
+                    .long 0x3784c3ad
+                    .long 0x38cce923
+                    .long 0x380f5faf
+                    .long 0x3891fd38
+                    .long 0x38ac47bc
+                    .long 0x3897042b
+                    .long 0x392952d2
+                    .long 0x396fced4
+                    .long 0x37f97073
+                    .long 0x385e9eae
+                    .long 0x3865c84a
+                    .long 0x38130ba3
+                    .long 0x3979cf16
+                    .long 0x3938cac9
+                    .long 0x38c3d2f4
+                    .long 0x39755dec
+                    .long 0x38e6b467
+                    .long 0x395c0fb8
+                    .long 0x383ebce0
+                    .long 0x38dcd192
+                    .long 0x39186bdf
+                    .long 0x392de74c
+                    .long 0x392f0944
+                    .long 0x391bff61
+                    .long 0x38e9ed44
+                    .long 0x38686dc8
+                    .long 0x396b99a7
+                    .long 0x39099c89
+                    .long 0x37a27673
+                    .long 0x390bdaa3
+                    .long 0x397069ab
+                    .long 0x388449ff
+                    .long 0x39013538
+                    .long 0x392dc268
+                    .long 0x3947f423
+                    .long 0x394ff17c
+                    .long 0x3945e10e
+                    .long 0x3929e8f5
+                    .long 0x38f85db0
+                    .long 0x38735f99
+                    .long 0x396c08db
+                    .long 0x3909e600
+                    .long 0x37b4996f
+                    .long 0x391233cc
+                    .long 0x397cead9
+                    .long 0x38adb5cd
+                    .long 0x3920261a
+                    .long 0x3958ee36
+                    .long 0x35aa4905
+                    .long 0x37cbd11e
+                    .long 0x3805fdf4
+
+.align 16
+.L__log_F_inv:
+                    .long 0x40000000
+                    .long 0x3ffe03f8
+                    .long 0x3ffc0fc1
+                    .long 0x3ffa232d
+                    .long 0x3ff83e10
+                    .long 0x3ff6603e
+                    .long 0x3ff4898d
+                    .long 0x3ff2b9d6
+                    .long 0x3ff0f0f1
+                    .long 0x3fef2eb7
+                    .long 0x3fed7304
+                    .long 0x3febbdb3
+                    .long 0x3fea0ea1
+                    .long 0x3fe865ac
+                    .long 0x3fe6c2b4
+                    .long 0x3fe52598
+                    .long 0x3fe38e39
+                    .long 0x3fe1fc78
+                    .long 0x3fe07038
+                    .long 0x3fdee95c
+                    .long 0x3fdd67c9
+                    .long 0x3fdbeb62
+                    .long 0x3fda740e
+                    .long 0x3fd901b2
+                    .long 0x3fd79436
+                    .long 0x3fd62b81
+                    .long 0x3fd4c77b
+                    .long 0x3fd3680d
+                    .long 0x3fd20d21
+                    .long 0x3fd0b6a0
+                    .long 0x3fcf6475
+                    .long 0x3fce168a
+                    .long 0x3fcccccd
+                    .long 0x3fcb8728
+                    .long 0x3fca4588
+                    .long 0x3fc907da
+                    .long 0x3fc7ce0c
+                    .long 0x3fc6980c
+                    .long 0x3fc565c8
+                    .long 0x3fc43730
+                    .long 0x3fc30c31
+                    .long 0x3fc1e4bc
+                    .long 0x3fc0c0c1
+                    .long 0x3fbfa030
+                    .long 0x3fbe82fa
+                    .long 0x3fbd6910
+                    .long 0x3fbc5264
+                    .long 0x3fbb3ee7
+                    .long 0x3fba2e8c
+                    .long 0x3fb92144
+                    .long 0x3fb81703
+                    .long 0x3fb70fbb
+                    .long 0x3fb60b61
+                    .long 0x3fb509e7
+                    .long 0x3fb40b41
+                    .long 0x3fb30f63
+                    .long 0x3fb21643
+                    .long 0x3fb11fd4
+                    .long 0x3fb02c0b
+                    .long 0x3faf3ade
+                    .long 0x3fae4c41
+                    .long 0x3fad602b
+                    .long 0x3fac7692
+                    .long 0x3fab8f6a
+                    .long 0x3faaaaab
+                    .long 0x3fa9c84a
+                    .long 0x3fa8e83f
+                    .long 0x3fa80a81
+                    .long 0x3fa72f05
+                    .long 0x3fa655c4
+                    .long 0x3fa57eb5
+                    .long 0x3fa4a9cf
+                    .long 0x3fa3d70a
+                    .long 0x3fa3065e
+                    .long 0x3fa237c3
+                    .long 0x3fa16b31
+                    .long 0x3fa0a0a1
+                    .long 0x3f9fd80a
+                    .long 0x3f9f1166
+                    .long 0x3f9e4cad
+                    .long 0x3f9d89d9
+                    .long 0x3f9cc8e1
+                    .long 0x3f9c09c1
+                    .long 0x3f9b4c70
+                    .long 0x3f9a90e8
+                    .long 0x3f99d723
+                    .long 0x3f991f1a
+                    .long 0x3f9868c8
+                    .long 0x3f97b426
+                    .long 0x3f97012e
+                    .long 0x3f964fda
+                    .long 0x3f95a025
+                    .long 0x3f94f209
+                    .long 0x3f944581
+                    .long 0x3f939a86
+                    .long 0x3f92f114
+                    .long 0x3f924925
+                    .long 0x3f91a2b4
+                    .long 0x3f90fdbc
+                    .long 0x3f905a38
+                    .long 0x3f8fb824
+                    .long 0x3f8f177a
+                    .long 0x3f8e7835
+                    .long 0x3f8dda52
+                    .long 0x3f8d3dcb
+                    .long 0x3f8ca29c
+                    .long 0x3f8c08c1
+                    .long 0x3f8b7034
+                    .long 0x3f8ad8f3
+                    .long 0x3f8a42f8
+                    .long 0x3f89ae41
+                    .long 0x3f891ac7
+                    .long 0x3f888889
+                    .long 0x3f87f781
+                    .long 0x3f8767ab
+                    .long 0x3f86d905
+                    .long 0x3f864b8a
+                    .long 0x3f85bf37
+                    .long 0x3f853408
+                    .long 0x3f84a9fa
+                    .long 0x3f842108
+                    .long 0x3f839930
+                    .long 0x3f83126f
+                    .long 0x3f828cc0
+                    .long 0x3f820821
+                    .long 0x3f81848e
+                    .long 0x3f810204
+                    .long 0x3f808081
+                    .long 0x3f800000
+
+

diff --git a/src/gas/nearbyint.S b/src/gas/nearbyint.S
new file mode 100644
index 0000000..edb1549
--- /dev/null
+++ b/src/gas/nearbyint.S

@@ -0,0 +1,98 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# fabs.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+#     double fabs(double x);
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(nearbyint)
+#define fname_special _nearbyint_special
+
+
+# local variable storage offsets
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:                                                      
+    movsd .L__2p52_mask_64(%rip),%xmm2                             
+    movsd .L__sign_mask_64(%rip),%xmm4                               
+    movsd %xmm4,%xmm6                                                 
+    movsd %xmm0,%xmm1  # move input to xmm register's xmm1 and xmm5 
+    movsd %xmm0,%xmm5 
+    pand  %xmm4,%xmm1  # xmm1 = abs(xmm1) 
+    movsd %xmm1,%xmm3  # move xmm1 to xmm3
+    comisd %xmm2,%xmm1 #  
+    jnc   .L__greater_than_2p52                      # 
+    jp    .L__is_infinity_nan  # parity flag is raised if one of the xmm2 or 
+                               # xmm1 is Nan 
+.L__normal_input_case:
+    #sign.u32 = checkbits.u32[1] & 0x80000000;
+    #xmm4 = sign.u32
+    pandn %xmm5,%xmm4
+    #val_2p52.u32[1] = sign.u32 | 0x43300000;
+    #val_2p52.u32[0] = 0;
+    por   %xmm4,%xmm2
+    #val_2p52.f64 = (x + val_2p52.f64) - val_2p52.f64;
+    addpd %xmm2,%xmm5 
+    subpd %xmm5,%xmm2
+    #val_2p52.u32[1] = ((val_2p52.u32[1] << 1) >> 1) | sign.u32;
+    pand  %xmm6,%xmm2
+    por   %xmm4,%xmm2
+    movsd %xmm2,%xmm0 # move the result to xmm0 register 
+    ret
+.L__special_case:
+.L__greater_than_2p52:
+    ret # result is present in xmm0
+.L__is_infinity_nan:
+    addpd %xmm0,%xmm0
+    ret
+.align 16
+.L__sign_mask_64:          .quad 0x7FFFFFFFFFFFFFFF
+                           .quad 0
+.L__2p52_mask_64:          .quad 0x4330000000000000 
+                           .quad 0
+.L__exp_mask_64:           .quad 0x7FF0000000000000
+                           .quad 0
+
+
+
+
+
+

diff --git a/src/gas/pow.S b/src/gas/pow.S
new file mode 100644
index 0000000..8028b83
--- /dev/null
+++ b/src/gas/pow.S

@@ -0,0 +1,2244 @@
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+#ifdef __x86_64__
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# pow.S
+#
+# An implementation of the pow libm function.
+#
+# Prototype:
+#
+#     double pow(double x, double y);
+#
+
+#
+#   Algorithm:
+#       x^y = e^(y*ln(x))
+#
+#       Look in exp, log for the respective algorithms
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(pow)
+#define fname_special _pow_special@PLT
+
+
+# local variable storage offsets
+.equ    save_x, 0x0
+.equ    save_y, 0x10
+.equ    p_temp_exp, 0x20
+.equ    negate_result, 0x30
+.equ    save_ax, 0x40
+.equ    y_head, 0x50
+.equ    p_temp_log, 0x60
+.equ    stack_size, 0x78
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+    sub         $stack_size, %rsp
+
+    movsd       %xmm0, save_x(%rsp)
+    movsd       %xmm1, save_y(%rsp)
+
+    mov         save_x(%rsp), %rdx
+    mov         save_y(%rsp), %r8
+
+    mov         .L__exp_mant_mask(%rip), %r10
+    and         %r8, %r10
+    jz          .L__y_is_zero
+
+    cmp         .L__pos_one(%rip), %r8
+    je          .L__y_is_one
+
+    mov         .L__sign_mask(%rip), %r9
+    and         %rdx, %r9
+    cmp         .L__sign_mask(%rip), %r9
+    mov         .L__pos_zero(%rip), %rax
+    mov         %rax, negate_result(%rsp)    
+    je          .L__x_is_neg
+
+    cmp         .L__pos_one(%rip), %rdx
+    je          .L__x_is_pos_one
+
+    cmp         .L__pos_zero(%rip), %rdx
+    je          .L__x_is_zero
+
+    mov         .L__exp_mask(%rip), %r9
+    and         %rdx, %r9
+    cmp         .L__exp_mask(%rip), %r9
+    je          .L__x_is_inf_or_nan
+   
+    mov         .L__exp_mask(%rip), %r10
+    and         %r8, %r10
+    cmp         .L__ay_max_bound(%rip), %r10
+    jg          .L__ay_is_very_large
+
+    mov         .L__exp_mask(%rip), %r10
+    and         %r8, %r10
+    cmp         .L__ay_min_bound(%rip), %r10
+    jl          .L__ay_is_very_small
+
+    # -----------------------------
+    # compute log(x) here
+    # -----------------------------
+.L__log_x:
+
+    # compute exponent part
+    xor         %r8, %r8
+    movdqa      %xmm0, %xmm3
+    psrlq       $52, %xmm3
+    movd        %xmm0, %r8
+    psubq       .L__mask_1023(%rip), %xmm3
+    movdqa      %xmm0, %xmm2
+    cvtdq2pd    %xmm3, %xmm6 # xexp
+    pand        .L__real_mant(%rip), %xmm2
+
+    comisd      .L__mask_1023_f(%rip), %xmm6
+    je          .L__denormal_adjust
+
+.L__continue_common:
+
+    # compute index into the log tables
+    movsd       %xmm0, %xmm7
+    mov         %r8, %r9
+    and         .L__mask_mant_all8(%rip), %r8
+    and         .L__mask_mant9(%rip), %r9
+    subsd       .L__real_one(%rip), %xmm7
+    shl         %r9
+    add         %r9, %r8
+    mov         %r8, p_temp_log(%rsp)
+    andpd       .L__real_notsign(%rip), %xmm7
+
+    # F, Y, switch to near-one codepath
+    movsd       p_temp_log(%rsp), %xmm1
+    shr         $44, %r8
+    por         .L__real_half(%rip), %xmm2
+    por         .L__real_half(%rip), %xmm1
+    comisd      .L__real_threshold(%rip), %xmm7
+    lea         .L__log_F_inv_head(%rip), %r9
+    lea         .L__log_F_inv_tail(%rip), %rdx
+    jb          .L__near_one
+
+    # f = F - Y, r = f * inv
+    subsd       %xmm2, %xmm1
+    movsd       %xmm1, %xmm4
+    mulsd       (%r9,%r8,8), %xmm1
+    movsd       %xmm1, %xmm5
+    mulsd       (%rdx,%r8,8), %xmm4
+    movsd       %xmm4, %xmm7
+    addsd       %xmm4, %xmm1
+
+    movsd       %xmm1, %xmm2
+    movsd       %xmm1, %xmm0
+    lea         .L__log_256_lead(%rip), %r9
+
+    # poly
+    movsd       .L__real_1_over_6(%rip), %xmm3
+    movsd       .L__real_1_over_3(%rip), %xmm1
+    mulsd       %xmm2, %xmm3                         
+    mulsd       %xmm2, %xmm1                         
+    mulsd       %xmm2, %xmm0                         
+    subsd       %xmm2, %xmm5
+    movsd       %xmm0, %xmm4
+    addsd       .L__real_1_over_5(%rip), %xmm3
+    addsd       .L__real_1_over_2(%rip), %xmm1
+    mulsd       %xmm0, %xmm4                         
+    mulsd       %xmm2, %xmm3                         
+    mulsd       %xmm0, %xmm1                         
+    addsd       .L__real_1_over_4(%rip), %xmm3
+    addsd       %xmm5, %xmm7
+    mulsd       %xmm4, %xmm3                         
+    addsd       %xmm3, %xmm1                         
+    addsd       %xmm7, %xmm1
+
+    movsd       .L__real_log2_tail(%rip), %xmm5
+    lea         .L__log_256_tail(%rip), %rdx
+    mulsd       %xmm6, %xmm5
+    movsd       (%r9,%r8,8), %xmm0
+    subsd       %xmm1, %xmm5
+
+    movsd       (%rdx,%r8,8), %xmm3
+    addsd       %xmm5, %xmm3
+    movsd       %xmm3, %xmm1
+    subsd       %xmm2, %xmm3
+
+    movsd       .L__real_log2_lead(%rip), %xmm7
+    mulsd       %xmm6, %xmm7
+    addsd       %xmm7, %xmm0
+
+    # result of ln(x) is computed from head and tail parts, resH and resT
+    # res = ln(x) = resH + resT
+    # resH and resT are in full precision 
+
+    # resT is computed from head and tail parts, resT_h and resT_t
+    # resT = resT_h + resT_t
+
+    # now
+    # xmm3 - resT
+    # xmm0 - resH
+    # xmm1 - (resT_t)
+    # xmm2 - (-resT_h)
+
+.L__log_x_continue:
+
+    movsd       %xmm0, %xmm7
+    addsd       %xmm3, %xmm0
+    movsd       %xmm0, %xmm5
+    andpd       .L__real_fffffffff8000000(%rip), %xmm0
+   
+    # xmm0 - H
+    # xmm7 - resH
+    # xmm5 - res
+
+    mov         save_y(%rsp), %rax
+    and         .L__real_fffffffff8000000(%rip), %rax
+
+    addsd       %xmm3, %xmm2
+    subsd       %xmm5, %xmm7
+    subsd       %xmm2, %xmm1
+    addsd       %xmm3, %xmm7
+    subsd       %xmm0, %xmm5
+
+    mov         %rax, y_head(%rsp)
+    movsd       save_y(%rsp), %xmm4
+   
+    addsd       %xmm1, %xmm7 
+    addsd       %xmm5, %xmm7
+
+    # res = H + T
+    # H has leading 26 bits of precision
+    # T has full precision
+
+    # xmm0 - H
+    # xmm7 - T
+
+    movsd       y_head(%rsp), %xmm2 
+    subsd       %xmm2, %xmm4
+
+    # y is split into head and tail
+    # for y * ln(x) computation
+
+    # xmm4 - Yt
+    # xmm2 - Yh
+    # xmm0 - H
+    # xmm7 - T
+
+    movsd   %xmm4, %xmm3
+    movsd   %xmm7, %xmm5
+    movsd   %xmm0, %xmm6
+    mulsd   %xmm7, %xmm3 # YtRt
+    mulsd   %xmm0, %xmm4 # YtRh
+    mulsd   %xmm2, %xmm5 # YhRt
+    mulsd   %xmm2, %xmm6 # YhRh 
+
+    movsd   %xmm6, %xmm1
+    addsd   %xmm4, %xmm3
+    addsd   %xmm5, %xmm3
+
+    addsd   %xmm3, %xmm1
+    movsd   %xmm1, %xmm0
+
+    subsd   %xmm1, %xmm6
+    addsd   %xmm3, %xmm6 
+
+    # y * ln(x) = v + vt
+    # v and vt are in full precision 
+ 
+    # xmm0 - v
+    # xmm6 - vt
+
+    # -----------------------------
+    # compute exp( y * ln(x) ) here
+    # -----------------------------
+
+    # v * (64/ln(2))
+    movsd       .L__real_64_by_log2(%rip), %xmm7
+    movsd       %xmm0, p_temp_exp(%rsp)
+    mulsd       %xmm0, %xmm7
+    mov         p_temp_exp(%rsp), %rdx
+
+    # v < 1024*ln(2), ( v * (64/ln(2)) ) < 64*1024
+    # v >= -1075*ln(2), ( v * (64/ln(2)) ) >= 64*(-1075)
+    comisd      .L__real_p65536(%rip), %xmm7
+    ja          .L__process_result_inf
+
+    comisd      .L__real_m68800(%rip), %xmm7
+    jb          .L__process_result_zero
+
+    # n = int( v * (64/ln(2)) )
+    cvtpd2dq    %xmm7, %xmm4
+    lea         .L__two_to_jby64_head_table(%rip), %r10
+    lea         .L__two_to_jby64_tail_table(%rip), %r11
+    cvtdq2pd    %xmm4, %xmm1
+
+    # r1 = x - n * ln(2)/64 head
+    movsd       .L__real_log2_by_64_head(%rip), %xmm2
+    mulsd       %xmm1, %xmm2
+    movd        %xmm4, %ecx
+    mov         $0x3f, %rax
+    and         %ecx, %eax
+    subsd       %xmm2, %xmm0
+
+    # r2 = - n * ln(2)/64 tail
+    mulsd       .L__real_log2_by_64_tail(%rip), %xmm1
+    movsd       %xmm0, %xmm2
+
+    # m = (n - j) / 64
+    sub         %eax, %ecx
+    sar         $6, %ecx
+
+    # r1+r2
+    addsd       %xmm1, %xmm2
+    addsd       %xmm6, %xmm2 # add vt here
+    movsd       %xmm2, %xmm1
+
+    # q
+    movsd       .L__real_1_by_2(%rip), %xmm0
+    movsd       .L__real_1_by_24(%rip), %xmm3
+    movsd       .L__real_1_by_720(%rip), %xmm4
+    mulsd       %xmm2, %xmm1
+    mulsd       %xmm2, %xmm0
+    mulsd       %xmm2, %xmm3
+    mulsd       %xmm2, %xmm4
+
+    movsd       %xmm1, %xmm5
+    mulsd       %xmm2, %xmm1
+    addsd       .L__real_one(%rip), %xmm0
+    addsd       .L__real_1_by_6(%rip), %xmm3
+    mulsd       %xmm1, %xmm5
+    addsd       .L__real_1_by_120(%rip), %xmm4
+    mulsd       %xmm2, %xmm0
+    mulsd       %xmm1, %xmm3
+ 
+    mulsd       %xmm5, %xmm4
+
+    # deal with denormal results
+    xor         %r9d, %r9d
+    cmp         .L__denormal_threshold(%rip), %ecx
+
+    addsd       %xmm4, %xmm3
+    addsd       %xmm3, %xmm0
+
+    cmovle      %ecx, %r9d
+    add         $1023, %rcx
+    shl         $52, %rcx
+
+    # f1, f2
+    movsd       (%r11,%rax,8), %xmm5
+    movsd       (%r10,%rax,8), %xmm1
+    mulsd       %xmm0, %xmm5
+    mulsd       %xmm0, %xmm1
+
+    cmp         .L__real_inf(%rip), %rcx
+
+    # (f1+f2)*(1+q)
+    addsd       (%r11,%rax,8), %xmm5
+    addsd       %xmm5, %xmm1
+    addsd       (%r10,%rax,8), %xmm1
+    movsd       %xmm1, %xmm0
+
+    je          .L__process_almost_inf
+
+    test        %r9d, %r9d
+    mov         %rcx, p_temp_exp(%rsp)
+    jnz         .L__process_denormal
+    mulsd       p_temp_exp(%rsp), %xmm0
+    orpd        negate_result(%rsp), %xmm0
+
+.L__final_check:
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__process_almost_inf:
+    comisd      .L__real_one(%rip), %xmm0
+    jae         .L__process_result_inf
+
+    orpd        .L__enable_almost_inf(%rip), %xmm0
+    orpd        negate_result(%rsp), %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__process_denormal:
+    mov         %r9d, %ecx
+    xor         %r11d, %r11d
+    comisd      .L__real_one(%rip), %xmm0
+    cmovae      %ecx, %r11d
+    cmp         .L__denormal_threshold(%rip), %r11d
+    jne         .L__process_true_denormal  
+
+    mulsd       p_temp_exp(%rsp), %xmm0
+    orpd        negate_result(%rsp), %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__process_true_denormal:
+    xor         %r8, %r8
+    cmp         .L__denormal_tiny_threshold(%rip), %rdx
+    mov         $1, %r9
+    jg          .L__process_denormal_tiny
+    add         $1074, %ecx
+    cmovs       %r8, %rcx
+    shl         %cl, %r9
+    mov         %r9, %rcx
+
+    mov         %rcx, p_temp_exp(%rsp)
+    mulsd       p_temp_exp(%rsp), %xmm0
+    orpd        negate_result(%rsp), %xmm0
+    jmp         .L__z_denormal        
+
+.p2align 4,,15
+.L__process_denormal_tiny:
+    movsd       .L__real_smallest_denormal(%rip), %xmm0
+    orpd        negate_result(%rsp), %xmm0
+    jmp         .L__z_denormal
+
+.p2align 4,,15
+.L__process_result_zero:
+    mov         .L__real_zero(%rip), %r11
+    or          negate_result(%rsp), %r11
+    jmp         .L__z_is_zero_or_inf
+ 
+.p2align 4,,15
+.L__process_result_inf:
+    mov         .L__real_inf(%rip), %r11
+    or          negate_result(%rsp), %r11
+    jmp         .L__z_is_zero_or_inf
+
+.p2align 4,,15
+.L__denormal_adjust:
+    por         .L__real_one(%rip), %xmm2
+    subsd       .L__real_one(%rip), %xmm2
+    movsd       %xmm2, %xmm5
+    pand        .L__real_mant(%rip), %xmm2
+    movd        %xmm2, %r8
+    psrlq       $52, %xmm5
+    psubd       .L__mask_2045(%rip), %xmm5
+    cvtdq2pd    %xmm5, %xmm6
+    jmp         .L__continue_common
+
+.p2align 4,,15
+.L__x_is_neg:
+
+    mov         .L__exp_mask(%rip), %r10
+    and         %r8, %r10
+    cmp         .L__ay_max_bound(%rip), %r10
+    jg          .L__ay_is_very_large
+
+    # determine if y is an integer
+    mov         .L__exp_mant_mask(%rip), %r10
+    and         %r8, %r10
+    mov         %r10, %r11
+    mov         .L__exp_shift(%rip), %rcx
+    shr         %cl, %r10
+    sub         .L__exp_bias(%rip), %r10
+    js          .L__x_is_neg_y_is_not_int
+   
+    mov         .L__exp_mant_mask(%rip), %rax
+    and         %rdx, %rax
+    mov         %rax, save_ax(%rsp)
+
+    cmp         .L__yexp_53(%rip), %r10
+    mov         %r10, %rcx
+    jg          .L__continue_after_y_int_check
+
+    mov         .L__mant_full(%rip), %r9
+    shr         %cl, %r9
+    and         %r11, %r9
+    jnz         .L__x_is_neg_y_is_not_int
+
+    mov         .L__1_before_mant(%rip), %r9
+    shr         %cl, %r9
+    and         %r11, %r9
+    jz          .L__continue_after_y_int_check
+
+    mov         .L__sign_mask(%rip), %rax
+    mov         %rax, negate_result(%rsp)    
+
+.L__continue_after_y_int_check:
+
+    cmp         .L__neg_zero(%rip), %rdx
+    je          .L__x_is_zero
+
+    cmp         .L__neg_one(%rip), %rdx
+    je          .L__x_is_neg_one
+
+    mov         .L__exp_mask(%rip), %r9
+    and         %rdx, %r9
+    cmp         .L__exp_mask(%rip), %r9
+    je          .L__x_is_inf_or_nan
+   
+    movsd       save_ax(%rsp), %xmm0
+    jmp         .L__log_x
+
+
+.p2align 4,,15
+.L__near_one:
+
+    # f = F - Y, r = f * inv
+    movsd       %xmm1, %xmm0
+    subsd       %xmm2, %xmm1
+    movsd       %xmm1, %xmm4
+
+    movsd       (%r9,%r8,8), %xmm3
+    addsd       (%rdx,%r8,8), %xmm3
+    mulsd       %xmm3, %xmm4
+    andpd       .L__real_fffffffff8000000(%rip), %xmm4
+    movsd       %xmm4, %xmm5 # r1
+    mulsd       %xmm0, %xmm4
+    subsd       %xmm4, %xmm1
+    mulsd       %xmm3, %xmm1
+    movsd       %xmm1, %xmm7 # r2
+    addsd       %xmm5, %xmm1
+
+    movsd       %xmm1, %xmm2
+    movsd       %xmm1, %xmm0
+
+    lea         .L__log_256_lead(%rip), %r9
+
+    # poly
+    movsd       .L__real_1_over_7(%rip), %xmm3
+    movsd       .L__real_1_over_4(%rip), %xmm1
+    mulsd       %xmm2, %xmm3                         
+    mulsd       %xmm2, %xmm1                         
+    mulsd       %xmm2, %xmm0                         
+    movsd       %xmm0, %xmm4
+    addsd       .L__real_1_over_6(%rip), %xmm3
+    addsd       .L__real_1_over_3(%rip), %xmm1
+    mulsd       %xmm0, %xmm4                         
+    mulsd       %xmm2, %xmm3                         
+    mulsd       %xmm2, %xmm1                         
+    addsd       .L__real_1_over_5(%rip), %xmm3
+    mulsd       %xmm2, %xmm3
+    mulsd       %xmm0, %xmm1
+    mulsd       %xmm4, %xmm3                         
+
+    movsd       %xmm5, %xmm2
+    movsd       %xmm7, %xmm0
+    mulsd       %xmm0, %xmm0
+    mulsd       .L__real_1_over_2(%rip), %xmm0
+    mulsd       %xmm7, %xmm5
+    addsd       %xmm0, %xmm5
+    addsd       %xmm7, %xmm5
+
+    movsd       %xmm2, %xmm0 
+    movsd       %xmm2, %xmm7
+    mulsd       %xmm0, %xmm0
+    mulsd       .L__real_1_over_2(%rip), %xmm0
+    movsd       %xmm0, %xmm4
+    addsd       %xmm0, %xmm2 # r1 + r1^2/2
+    subsd       %xmm2, %xmm7
+    addsd       %xmm4, %xmm7
+
+    addsd       %xmm7, %xmm3 
+    movsd       .L__real_log2_tail(%rip), %xmm4
+    addsd       %xmm3, %xmm1                         
+    mulsd       %xmm6, %xmm4
+    lea         .L__log_256_tail(%rip), %rdx
+    addsd       %xmm5, %xmm1
+    addsd       (%rdx,%r8,8), %xmm4
+    subsd       %xmm1, %xmm4
+
+    movsd       %xmm4, %xmm3
+    movsd       %xmm4, %xmm1
+    subsd       %xmm2, %xmm3
+
+    movsd       (%r9,%r8,8), %xmm0
+    movsd       .L__real_log2_lead(%rip), %xmm7
+    mulsd       %xmm6, %xmm7
+    addsd       %xmm7, %xmm0
+
+    jmp         .L__log_x_continue
+
+
+.p2align 4,,15
+.L__x_is_pos_one:
+    xor         %rax, %rax
+    mov         .L__exp_mask(%rip), %r10
+    and         %r8, %r10
+    cmp         .L__exp_mask(%rip), %r10
+    cmove       %r8, %rax
+    mov         .L__mant_mask(%rip), %r10
+    and         %rax, %r10
+    jz          .L__final_check
+
+    mov         .L__qnan_set(%rip), %r10
+    and         %r8, %r10
+    jnz         .L__final_check
+
+    movsd       save_x(%rsp), %xmm0
+    movsd       save_y(%rsp), %xmm1
+    movsd       .L__pos_one(%rip), %xmm2
+    mov         .L__flag_x_one_y_snan(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__y_is_zero:
+    xor         %rax, %rax
+    mov         .L__exp_mask(%rip), %r9
+    mov         .L__real_one(%rip), %r11
+    and         %rdx, %r9
+    cmp         .L__exp_mask(%rip), %r9
+    cmove       %rdx, %rax
+    mov         .L__mant_mask(%rip), %r9
+    and         %rax, %r9
+    jnz         .L__x_is_nan
+
+    movsd       .L__real_one(%rip), %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__y_is_one:
+    xor         %rax, %rax
+    mov         %rdx, %r11
+    mov         .L__exp_mask(%rip), %r9
+    or          .L__qnan_set(%rip), %r11
+    and         %rdx, %r9
+    cmp         .L__exp_mask(%rip), %r9
+    cmove       %rdx, %rax
+    mov         .L__mant_mask(%rip), %r9
+    and         %rax, %r9
+    jnz         .L__x_is_nan
+
+    movd        %rdx, %xmm0 
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_neg_one:
+    mov         .L__pos_one(%rip), %rdx
+    or          negate_result(%rsp), %rdx
+    xor         %rax, %rax
+    mov         %r8, %r11
+    mov         .L__exp_mask(%rip), %r10
+    or          .L__qnan_set(%rip), %r11
+    and         %r8, %r10
+    cmp         .L__exp_mask(%rip), %r10
+    cmove       %r8, %rax
+    mov         .L__mant_mask(%rip), %r10
+    and         %rax, %r10
+    jnz         .L__y_is_nan
+
+    movd        %rdx, %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_neg_y_is_not_int:
+    mov         .L__exp_mask(%rip), %r9
+    and         %rdx, %r9
+    cmp         .L__exp_mask(%rip), %r9
+    je          .L__x_is_inf_or_nan
+
+    cmp         .L__neg_zero(%rip), %rdx
+    je          .L__x_is_zero
+
+    movsd       save_x(%rsp), %xmm0
+    movsd       save_y(%rsp), %xmm1
+    movsd       .L__qnan(%rip), %xmm2
+    mov         .L__flag_x_neg_y_notint(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__ay_is_very_large:
+    mov         .L__exp_mask(%rip), %r9
+    and         %rdx, %r9
+    cmp         .L__exp_mask(%rip), %r9
+    je          .L__x_is_inf_or_nan
+
+    mov         .L__exp_mant_mask(%rip), %r9
+    and         %rdx, %r9
+    jz          .L__x_is_zero 
+
+    cmp         .L__neg_one(%rip), %rdx
+    je          .L__x_is_neg_one
+
+    mov         %rdx, %r9
+    and         .L__exp_mant_mask(%rip), %r9
+    cmp         .L__pos_one(%rip), %r9
+    jl          .L__ax_lt1_y_is_large_or_inf_or_nan
+  
+    jmp         .L__ax_gt1_y_is_large_or_inf_or_nan
+
+.p2align 4,,15
+.L__x_is_zero:
+    mov         .L__exp_mask(%rip), %r10
+    xor         %rax, %rax
+    and         %r8, %r10
+    cmp         .L__exp_mask(%rip), %r10
+    je          .L__x_is_zero_y_is_inf_or_nan
+
+    mov         .L__sign_mask(%rip), %r10
+    and         %r8, %r10
+    cmovnz      .L__pos_inf(%rip), %rax
+    jnz         .L__x_is_zero_z_is_inf
+
+    movd        %rax, %xmm0
+    orpd        negate_result(%rsp), %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_zero_z_is_inf:
+
+    movsd       save_x(%rsp), %xmm0
+    movsd       save_y(%rsp), %xmm1
+    movd        %rax, %xmm2
+    orpd        negate_result(%rsp), %xmm2
+    mov         .L__flag_x_zero_z_inf(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_zero_y_is_inf_or_nan:
+    mov         %r8, %r11
+    cmp         .L__neg_inf(%rip), %r8
+    cmove       .L__pos_inf(%rip), %rax
+    je          .L__x_is_zero_z_is_inf
+
+    or          .L__qnan_set(%rip), %r11
+    mov         .L__mant_mask(%rip), %r10
+    and         %r8, %r10
+    jnz         .L__y_is_nan
+
+    movd        %rax, %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+    xor         %r11, %r11
+    mov         .L__sign_mask(%rip), %r10
+    and         %r8, %r10
+    cmovz       .L__pos_inf(%rip), %r11
+    mov         %rdx, %rax
+    mov         .L__mant_mask(%rip), %r9
+    or          .L__qnan_set(%rip), %rax
+    and         %rdx, %r9
+    cmovnz      %rax, %r11
+    jnz         .L__x_is_nan
+
+    xor         %rax, %rax
+    mov         %r8, %r9
+    mov         .L__exp_mask(%rip), %r10
+    or          .L__qnan_set(%rip), %r9
+    and         %r8, %r10
+    cmp         .L__exp_mask(%rip), %r10
+    cmove       %r8, %rax
+    mov         .L__mant_mask(%rip), %r10
+    and         %rax, %r10
+    cmovnz      %r9, %r11
+    jnz         .L__y_is_nan
+
+    movd        %r11, %xmm0
+    orpd        negate_result(%rsp), %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__ay_is_very_small:
+    movsd       .L__pos_one(%rip), %xmm0
+    addsd       %xmm1, %xmm0
+    jmp         .L__final_check
+
+
+.p2align 4,,15
+.L__ax_lt1_y_is_large_or_inf_or_nan:
+    xor         %r11, %r11
+    mov         .L__sign_mask(%rip), %r10
+    and         %r8, %r10
+    cmovnz      .L__pos_inf(%rip), %r11
+    jmp         .L__adjust_for_nan
+
+.p2align 4,,15
+.L__ax_gt1_y_is_large_or_inf_or_nan:
+    xor         %r11, %r11
+    mov         .L__sign_mask(%rip), %r10
+    and         %r8, %r10
+    cmovz       .L__pos_inf(%rip), %r11
+
+.p2align 4,,15
+.L__adjust_for_nan:
+
+    xor         %rax, %rax
+    mov         %r8, %r9
+    mov         .L__exp_mask(%rip), %r10
+    or          .L__qnan_set(%rip), %r9
+    and         %r8, %r10
+    cmp         .L__exp_mask(%rip), %r10
+    cmove       %r8, %rax
+    mov         .L__mant_mask(%rip), %r10
+    and         %rax, %r10
+    cmovnz      %r9, %r11
+    jnz         .L__y_is_nan
+
+    test        %rax, %rax
+    jnz         .L__y_is_inf
+
+.p2align 4,,15
+.L__z_is_zero_or_inf:
+
+    mov         .L__flag_z_zero(%rip), %edi
+    test        %r11, %r11
+    cmovnz      .L__flag_z_inf(%rip), %edi
+    
+    movsd       save_x(%rsp), %xmm0
+    movsd       save_y(%rsp), %xmm1
+    movd        %r11, %xmm2
+
+    call        fname_special
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__y_is_inf:
+
+    movd        %r11, %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_nan:
+
+    xor         %rax, %rax
+    mov         .L__exp_mask(%rip), %r10
+    and         %r8, %r10
+    cmp         .L__exp_mask(%rip), %r10
+    cmove       %r8, %rax
+    mov         .L__mant_mask(%rip), %r10
+    and         %rax, %r10
+    jnz         .L__x_is_nan_y_is_nan
+
+    mov         .L__qnan_set(%rip), %r9
+    and         %rdx, %r9
+    movd        %r11, %xmm0
+    jnz         .L__final_check
+
+    movsd       save_x(%rsp), %xmm0
+    movsd       save_y(%rsp), %xmm1
+    movd        %r11, %xmm2
+    mov         .L__flag_x_nan(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__y_is_nan:
+
+    mov         .L__qnan_set(%rip), %r10
+    and         %r8, %r10
+    movd        %r11, %xmm0
+    jnz         .L__final_check
+
+    movsd       save_x(%rsp), %xmm0
+    movsd       save_y(%rsp), %xmm1
+    movd        %r11, %xmm2
+    mov         .L__flag_y_nan(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_nan_y_is_nan:
+
+    mov         .L__qnan_set(%rip), %r9
+    and         %rdx, %r9
+    jz          .L__continue_xy_nan
+
+    mov         .L__qnan_set(%rip), %r10
+    and         %r8, %r10
+    jz          .L__continue_xy_nan
+
+    movd        %r11, %xmm0
+    jmp         .L__final_check
+
+.L__continue_xy_nan:    
+    movsd       save_x(%rsp), %xmm0
+    movsd       save_y(%rsp), %xmm1
+    movd        %r11, %xmm2
+    mov         .L__flag_x_nan_y_nan(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check  
+    
+.p2align 4,,15
+.L__z_denormal:
+    
+    movsd       %xmm0, %xmm2
+    movsd       save_x(%rsp), %xmm0
+    movsd       save_y(%rsp), %xmm1
+    mov         .L__flag_z_denormal(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check  
+
+
+.data
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_one_y_snan:          .long 1
+.L__flag_x_zero_z_inf:          .long 2
+.L__flag_x_nan:                 .long 3
+.L__flag_y_nan:                 .long 4
+.L__flag_x_nan_y_nan:           .long 5
+.L__flag_x_neg_y_notint:        .long 6
+.L__flag_z_zero:                .long 7
+.L__flag_z_denormal:            .long 8
+.L__flag_z_inf:                 .long 9
+
+.align 16
+   
+.L__ay_max_bound:           .quad 0x43e0000000000000
+.L__ay_min_bound:           .quad 0x3c00000000000000
+.L__sign_mask:              .quad 0x8000000000000000
+.L__sign_and_exp_mask:      .quad 0x0fff0000000000000
+.L__exp_mask:               .quad 0x7ff0000000000000
+.L__neg_inf:                .quad 0x0fff0000000000000
+.L__pos_inf:                .quad 0x7ff0000000000000
+.L__pos_one:                .quad 0x3ff0000000000000
+.L__pos_zero:               .quad 0x0000000000000000
+.L__exp_mant_mask:          .quad 0x7fffffffffffffff
+.L__mant_mask:              .quad 0x000fffffffffffff
+.L__ind_pattern:            .quad 0x0fff8000000000000
+
+.L__neg_qnan:               .quad 0x0fff8000000000000
+.L__qnan:                   .quad 0x7ff8000000000000
+.L__qnan_set:               .quad 0x0008000000000000
+
+.L__neg_one:                .quad 0x0bff0000000000000
+.L__neg_zero:               .quad 0x8000000000000000
+
+.L__exp_shift:              .quad 0x0000000000000034 # 52
+.L__exp_bias:               .quad 0x00000000000003ff # 1023
+.L__exp_bias_m1:            .quad 0x00000000000003fe # 1022
+
+.L__yexp_53:                .quad 0x0000000000000035 # 53
+.L__mant_full:              .quad 0x000fffffffffffff
+.L__1_before_mant:          .quad 0x0010000000000000
+
+.L__mask_mant_all8:         .quad 0x000ff00000000000
+.L__mask_mant9:             .quad 0x0000080000000000
+
+.align 16
+.L__real_fffffffff8000000:  .quad 0x0fffffffff8000000
+                            .quad 0x0fffffffff8000000
+
+.L__mask_8000000000000000:  .quad 0x8000000000000000
+                            .quad 0x8000000000000000
+
+.L__real_4090040000000000:  .quad 0x4090040000000000
+                            .quad 0x4090040000000000
+
+.L__real_C090C80000000000:  .quad 0x0C090C80000000000
+                            .quad 0x0C090C80000000000
+
+#---------------------
+# log data
+#---------------------
+
+.align 16
+
+.L__real_ninf:  .quad 0x0fff0000000000000   # -inf
+                .quad 0x0000000000000000
+.L__real_inf:   .quad 0x7ff0000000000000    # +inf
+                .quad 0x0000000000000000
+.L__real_nan:   .quad 0x7ff8000000000000    # NaN
+                .quad 0x0000000000000000
+.L__real_mant:  .quad 0x000FFFFFFFFFFFFF    # mantissa bits
+                .quad 0x0000000000000000
+.L__mask_1023:  .quad 0x00000000000003ff
+                .quad 0x0000000000000000
+.L__mask_001:   .quad 0x0000000000000001
+                .quad 0x0000000000000000
+
+
+.L__real_log2_lead: .quad 0x3fe62e42e0000000 # log2_lead  6.93147122859954833984e-01
+                    .quad 0x0000000000000000
+.L__real_log2_tail: .quad 0x3e6efa39ef35793c # log2_tail  5.76999904754328540596e-08
+                    .quad 0x0000000000000000
+
+.L__real_two:       .quad 0x4000000000000000 # 2
+                    .quad 0x0000000000000000
+
+.L__real_one:       .quad 0x3ff0000000000000 # 1
+                    .quad 0x0000000000000000
+
+.L__real_half:      .quad 0x3fe0000000000000 # 1/2
+                    .quad 0x0000000000000000
+
+.L__mask_100:       .quad 0x0000000000000100
+                    .quad 0x0000000000000000
+
+.L__real_1_over_2:  .quad 0x3fe0000000000000
+                    .quad 0x0000000000000000
+.L__real_1_over_3:  .quad 0x3fd5555555555555
+                    .quad 0x0000000000000000
+.L__real_1_over_4:  .quad 0x3fd0000000000000
+                    .quad 0x0000000000000000
+.L__real_1_over_5:  .quad 0x3fc999999999999a
+                    .quad 0x0000000000000000
+.L__real_1_over_6:  .quad 0x3fc5555555555555
+                    .quad 0x0000000000000000
+.L__real_1_over_7:  .quad 0x3fc2492492492494
+                    .quad 0x0000000000000000
+
+.L__mask_1023_f:    .quad 0x0c08ff80000000000
+                    .quad 0x0000000000000000
+
+.L__mask_2045:      .quad 0x00000000000007fd
+                    .quad 0x0000000000000000
+
+.L__real_threshold: .quad 0x3fc0000000000000 # 0.125
+                    .quad 0x3fc0000000000000
+
+.L__real_notsign:   .quad 0x7ffFFFFFFFFFFFFF # ^sign bit
+                    .quad 0x0000000000000000
+
+
+.align 16
+.L__log_256_lead:
+                    .quad 0x0000000000000000
+                    .quad 0x3f6ff00aa0000000
+                    .quad 0x3f7fe02a60000000
+                    .quad 0x3f87dc4750000000
+                    .quad 0x3f8fc0a8b0000000
+                    .quad 0x3f93cea440000000
+                    .quad 0x3f97b91b00000000
+                    .quad 0x3f9b9fc020000000
+                    .quad 0x3f9f829b00000000
+                    .quad 0x3fa1b0d980000000
+                    .quad 0x3fa39e87b0000000
+                    .quad 0x3fa58a5ba0000000
+                    .quad 0x3fa77458f0000000
+                    .quad 0x3fa95c8300000000
+                    .quad 0x3fab42dd70000000
+                    .quad 0x3fad276b80000000
+                    .quad 0x3faf0a30c0000000
+                    .quad 0x3fb0759830000000
+                    .quad 0x3fb16536e0000000
+                    .quad 0x3fb253f620000000
+                    .quad 0x3fb341d790000000
+                    .quad 0x3fb42edcb0000000
+                    .quad 0x3fb51b0730000000
+                    .quad 0x3fb60658a0000000
+                    .quad 0x3fb6f0d280000000
+                    .quad 0x3fb7da7660000000
+                    .quad 0x3fb8c345d0000000
+                    .quad 0x3fb9ab4240000000
+                    .quad 0x3fba926d30000000
+                    .quad 0x3fbb78c820000000
+                    .quad 0x3fbc5e5480000000
+                    .quad 0x3fbd4313d0000000
+                    .quad 0x3fbe270760000000
+                    .quad 0x3fbf0a30c0000000
+                    .quad 0x3fbfec9130000000
+                    .quad 0x3fc0671510000000
+                    .quad 0x3fc0d77e70000000
+                    .quad 0x3fc1478580000000
+                    .quad 0x3fc1b72ad0000000
+                    .quad 0x3fc2266f10000000
+                    .quad 0x3fc29552f0000000
+                    .quad 0x3fc303d710000000
+                    .quad 0x3fc371fc20000000
+                    .quad 0x3fc3dfc2b0000000
+                    .quad 0x3fc44d2b60000000
+                    .quad 0x3fc4ba36f0000000
+                    .quad 0x3fc526e5e0000000
+                    .quad 0x3fc59338d0000000
+                    .quad 0x3fc5ff3070000000
+                    .quad 0x3fc66acd40000000
+                    .quad 0x3fc6d60fe0000000
+                    .quad 0x3fc740f8f0000000
+                    .quad 0x3fc7ab8900000000
+                    .quad 0x3fc815c0a0000000
+                    .quad 0x3fc87fa060000000
+                    .quad 0x3fc8e928d0000000
+                    .quad 0x3fc9525a90000000
+                    .quad 0x3fc9bb3620000000
+                    .quad 0x3fca23bc10000000
+                    .quad 0x3fca8becf0000000
+                    .quad 0x3fcaf3c940000000
+                    .quad 0x3fcb5b5190000000
+                    .quad 0x3fcbc28670000000
+                    .quad 0x3fcc296850000000
+                    .quad 0x3fcc8ff7c0000000
+                    .quad 0x3fccf63540000000
+                    .quad 0x3fcd5c2160000000
+                    .quad 0x3fcdc1bca0000000
+                    .quad 0x3fce270760000000
+                    .quad 0x3fce8c0250000000
+                    .quad 0x3fcef0adc0000000
+                    .quad 0x3fcf550a50000000
+                    .quad 0x3fcfb91860000000
+                    .quad 0x3fd00e6c40000000
+                    .quad 0x3fd0402590000000
+                    .quad 0x3fd071b850000000
+                    .quad 0x3fd0a324e0000000
+                    .quad 0x3fd0d46b50000000
+                    .quad 0x3fd1058bf0000000
+                    .quad 0x3fd1368700000000
+                    .quad 0x3fd1675ca0000000
+                    .quad 0x3fd1980d20000000
+                    .quad 0x3fd1c898c0000000
+                    .quad 0x3fd1f8ff90000000
+                    .quad 0x3fd22941f0000000
+                    .quad 0x3fd2596010000000
+                    .quad 0x3fd2895a10000000
+                    .quad 0x3fd2b93030000000
+                    .quad 0x3fd2e8e2b0000000
+                    .quad 0x3fd31871c0000000
+                    .quad 0x3fd347dd90000000
+                    .quad 0x3fd3772660000000
+                    .quad 0x3fd3a64c50000000
+                    .quad 0x3fd3d54fa0000000
+                    .quad 0x3fd4043080000000
+                    .quad 0x3fd432ef20000000
+                    .quad 0x3fd4618bc0000000
+                    .quad 0x3fd4900680000000
+                    .quad 0x3fd4be5f90000000
+                    .quad 0x3fd4ec9730000000
+                    .quad 0x3fd51aad80000000
+                    .quad 0x3fd548a2c0000000
+                    .quad 0x3fd5767710000000
+                    .quad 0x3fd5a42ab0000000
+                    .quad 0x3fd5d1bdb0000000
+                    .quad 0x3fd5ff3070000000
+                    .quad 0x3fd62c82f0000000
+                    .quad 0x3fd659b570000000
+                    .quad 0x3fd686c810000000
+                    .quad 0x3fd6b3bb20000000
+                    .quad 0x3fd6e08ea0000000
+                    .quad 0x3fd70d42e0000000
+                    .quad 0x3fd739d7f0000000
+                    .quad 0x3fd7664e10000000
+                    .quad 0x3fd792a550000000
+                    .quad 0x3fd7bede00000000
+                    .quad 0x3fd7eaf830000000
+                    .quad 0x3fd816f410000000
+                    .quad 0x3fd842d1d0000000
+                    .quad 0x3fd86e9190000000
+                    .quad 0x3fd89a3380000000
+                    .quad 0x3fd8c5b7c0000000
+                    .quad 0x3fd8f11e80000000
+                    .quad 0x3fd91c67e0000000
+                    .quad 0x3fd9479410000000
+                    .quad 0x3fd972a340000000
+                    .quad 0x3fd99d9580000000
+                    .quad 0x3fd9c86b00000000
+                    .quad 0x3fd9f323e0000000
+                    .quad 0x3fda1dc060000000
+                    .quad 0x3fda484090000000
+                    .quad 0x3fda72a490000000
+                    .quad 0x3fda9cec90000000
+                    .quad 0x3fdac718c0000000
+                    .quad 0x3fdaf12930000000
+                    .quad 0x3fdb1b1e00000000
+                    .quad 0x3fdb44f770000000
+                    .quad 0x3fdb6eb590000000
+                    .quad 0x3fdb985890000000
+                    .quad 0x3fdbc1e080000000
+                    .quad 0x3fdbeb4d90000000
+                    .quad 0x3fdc149ff0000000
+                    .quad 0x3fdc3dd7a0000000
+                    .quad 0x3fdc66f4e0000000
+                    .quad 0x3fdc8ff7c0000000
+                    .quad 0x3fdcb8e070000000
+                    .quad 0x3fdce1af00000000
+                    .quad 0x3fdd0a63a0000000
+                    .quad 0x3fdd32fe70000000
+                    .quad 0x3fdd5b7f90000000
+                    .quad 0x3fdd83e720000000
+                    .quad 0x3fddac3530000000
+                    .quad 0x3fddd46a00000000
+                    .quad 0x3fddfc8590000000
+                    .quad 0x3fde248810000000
+                    .quad 0x3fde4c71a0000000
+                    .quad 0x3fde744260000000
+                    .quad 0x3fde9bfa60000000
+                    .quad 0x3fdec399d0000000
+                    .quad 0x3fdeeb20c0000000
+                    .quad 0x3fdf128f50000000
+                    .quad 0x3fdf39e5b0000000
+                    .quad 0x3fdf6123f0000000
+                    .quad 0x3fdf884a30000000
+                    .quad 0x3fdfaf5880000000
+                    .quad 0x3fdfd64f20000000
+                    .quad 0x3fdffd2e00000000
+                    .quad 0x3fe011fab0000000
+                    .quad 0x3fe02552a0000000
+                    .quad 0x3fe0389ee0000000
+                    .quad 0x3fe04bdf90000000
+                    .quad 0x3fe05f14b0000000
+                    .quad 0x3fe0723e50000000
+                    .quad 0x3fe0855c80000000
+                    .quad 0x3fe0986f40000000
+                    .quad 0x3fe0ab76b0000000
+                    .quad 0x3fe0be72e0000000
+                    .quad 0x3fe0d163c0000000
+                    .quad 0x3fe0e44980000000
+                    .quad 0x3fe0f72410000000
+                    .quad 0x3fe109f390000000
+                    .quad 0x3fe11cb810000000
+                    .quad 0x3fe12f7190000000
+                    .quad 0x3fe1422020000000
+                    .quad 0x3fe154c3d0000000
+                    .quad 0x3fe1675ca0000000
+                    .quad 0x3fe179eab0000000
+                    .quad 0x3fe18c6e00000000
+                    .quad 0x3fe19ee6b0000000
+                    .quad 0x3fe1b154b0000000
+                    .quad 0x3fe1c3b810000000
+                    .quad 0x3fe1d610f0000000
+                    .quad 0x3fe1e85f50000000
+                    .quad 0x3fe1faa340000000
+                    .quad 0x3fe20cdcd0000000
+                    .quad 0x3fe21f0bf0000000
+                    .quad 0x3fe23130d0000000
+                    .quad 0x3fe2434b60000000
+                    .quad 0x3fe2555bc0000000
+                    .quad 0x3fe2676200000000
+                    .quad 0x3fe2795e10000000
+                    .quad 0x3fe28b5000000000
+                    .quad 0x3fe29d37f0000000
+                    .quad 0x3fe2af15f0000000
+                    .quad 0x3fe2c0e9e0000000
+                    .quad 0x3fe2d2b400000000
+                    .quad 0x3fe2e47430000000
+                    .quad 0x3fe2f62a90000000
+                    .quad 0x3fe307d730000000
+                    .quad 0x3fe3197a00000000
+                    .quad 0x3fe32b1330000000
+                    .quad 0x3fe33ca2b0000000
+                    .quad 0x3fe34e2890000000
+                    .quad 0x3fe35fa4e0000000
+                    .quad 0x3fe37117b0000000
+                    .quad 0x3fe38280f0000000
+                    .quad 0x3fe393e0d0000000
+                    .quad 0x3fe3a53730000000
+                    .quad 0x3fe3b68440000000
+                    .quad 0x3fe3c7c7f0000000
+                    .quad 0x3fe3d90260000000
+                    .quad 0x3fe3ea3390000000
+                    .quad 0x3fe3fb5b80000000
+                    .quad 0x3fe40c7a40000000
+                    .quad 0x3fe41d8fe0000000
+                    .quad 0x3fe42e9c60000000
+                    .quad 0x3fe43f9fe0000000
+                    .quad 0x3fe4509a50000000
+                    .quad 0x3fe4618bc0000000
+                    .quad 0x3fe4727430000000
+                    .quad 0x3fe48353d0000000
+                    .quad 0x3fe4942a80000000
+                    .quad 0x3fe4a4f850000000
+                    .quad 0x3fe4b5bd60000000
+                    .quad 0x3fe4c679a0000000
+                    .quad 0x3fe4d72d30000000
+                    .quad 0x3fe4e7d810000000
+                    .quad 0x3fe4f87a30000000
+                    .quad 0x3fe50913c0000000
+                    .quad 0x3fe519a4c0000000
+                    .quad 0x3fe52a2d20000000
+                    .quad 0x3fe53aad00000000
+                    .quad 0x3fe54b2460000000
+                    .quad 0x3fe55b9350000000
+                    .quad 0x3fe56bf9d0000000
+                    .quad 0x3fe57c57f0000000
+                    .quad 0x3fe58cadb0000000
+                    .quad 0x3fe59cfb20000000
+                    .quad 0x3fe5ad4040000000
+                    .quad 0x3fe5bd7d30000000
+                    .quad 0x3fe5cdb1d0000000
+                    .quad 0x3fe5ddde50000000
+                    .quad 0x3fe5ee02a0000000
+                    .quad 0x3fe5fe1ed0000000
+                    .quad 0x3fe60e32f0000000
+                    .quad 0x3fe61e3ef0000000
+                    .quad 0x3fe62e42e0000000
+                    .quad 0x0000000000000000
+
+.align 16
+.L__log_256_tail:
+                    .quad 0x0000000000000000
+                    .quad 0x3db5885e0250435a
+                    .quad 0x3de620cf11f86ed2
+                    .quad 0x3dff0214edba4a25
+                    .quad 0x3dbf807c79f3db4e
+                    .quad 0x3dea352ba779a52b
+                    .quad 0x3dff56c46aa49fd5
+                    .quad 0x3dfebe465fef5196
+                    .quad 0x3e0cf0660099f1f8
+                    .quad 0x3e1247b2ff85945d
+                    .quad 0x3e13fd7abf5202b6
+                    .quad 0x3e1f91c9a918d51e
+                    .quad 0x3e08cb73f118d3ca
+                    .quad 0x3e1d91c7d6fad074
+                    .quad 0x3de1971bec28d14c
+                    .quad 0x3e15b616a423c78a
+                    .quad 0x3da162a6617cc971
+                    .quad 0x3e166391c4c06d29
+                    .quad 0x3e2d46f5c1d0c4b8
+                    .quad 0x3e2e14282df1f6d3
+                    .quad 0x3e186f47424a660d
+                    .quad 0x3e2d4c8de077753e
+                    .quad 0x3e2e0c307ed24f1c
+                    .quad 0x3e226ea18763bdd3
+                    .quad 0x3e25cad69737c933
+                    .quad 0x3e2af62599088901
+                    .quad 0x3e18c66c83d6b2d0
+                    .quad 0x3e1880ceb36fb30f
+                    .quad 0x3e2495aac6ca17a4
+                    .quad 0x3e2761db4210878c
+                    .quad 0x3e2eb78e862bac2f
+                    .quad 0x3e19b2cd75790dd9
+                    .quad 0x3e2c55e5cbd3d50f
+                    .quad 0x3db162a6617cc971
+                    .quad 0x3dfdbeabaaa2e519
+                    .quad 0x3e1652cb7150c647
+                    .quad 0x3e39a11cb2cd2ee2
+                    .quad 0x3e219d0ab1a28813
+                    .quad 0x3e24bd9e80a41811
+                    .quad 0x3e3214b596faa3df
+                    .quad 0x3e303fea46980bb8
+                    .quad 0x3e31c8ffa5fd28c7
+                    .quad 0x3dce8f743bcd96c5
+                    .quad 0x3dfd98c5395315c6
+                    .quad 0x3e3996fa3ccfa7b2
+                    .quad 0x3e1cd2af2ad13037
+                    .quad 0x3e1d0da1bd17200e
+                    .quad 0x3e3330410ba68b75
+                    .quad 0x3df4f27a790e7c41
+                    .quad 0x3e13956a86f6ff1b
+                    .quad 0x3e2c6748723551d9
+                    .quad 0x3e2500de9326cdfc
+                    .quad 0x3e1086c848df1b59
+                    .quad 0x3e04357ead6836ff
+                    .quad 0x3e24832442408024
+                    .quad 0x3e3d10da8154b13d
+                    .quad 0x3e39e8ad68ec8260
+                    .quad 0x3e3cfbf706abaf18
+                    .quad 0x3e3fc56ac6326e23
+                    .quad 0x3e39105e3185cf21
+                    .quad 0x3e3d017fe5b19cc0
+                    .quad 0x3e3d1f6b48dd13fe
+                    .quad 0x3e20b63358a7e73a
+                    .quad 0x3e263063028c211c
+                    .quad 0x3e2e6a6886b09760
+                    .quad 0x3e3c138bb891cd03
+                    .quad 0x3e369f7722b7221a
+                    .quad 0x3df57d8fac1a628c
+                    .quad 0x3e3c55e5cbd3d50f
+                    .quad 0x3e1552d2ff48fe2e
+                    .quad 0x3e37b8b26ca431bc
+                    .quad 0x3e292decdc1c5f6d
+                    .quad 0x3e3abc7c551aaa8c
+                    .quad 0x3e36b540731a354b
+                    .quad 0x3e32d341036b89ef
+                    .quad 0x3e4f9ab21a3a2e0f
+                    .quad 0x3e239c871afb9fbd
+                    .quad 0x3e3e6add2c81f640
+                    .quad 0x3e435c95aa313f41
+                    .quad 0x3e249d4582f6cc53
+                    .quad 0x3e47574c1c07398f
+                    .quad 0x3e4ba846dece9e8d
+                    .quad 0x3e16999fafbc68e7
+                    .quad 0x3e4c9145e51b0103
+                    .quad 0x3e479ef2cb44850a
+                    .quad 0x3e0beec73de11275
+                    .quad 0x3e2ef4351af5a498
+                    .quad 0x3e45713a493b4a50
+                    .quad 0x3e45c23a61385992
+                    .quad 0x3e42a88309f57299
+                    .quad 0x3e4530faa9ac8ace
+                    .quad 0x3e25fec2d792a758
+                    .quad 0x3e35a517a71cbcd7
+                    .quad 0x3e3707dc3e1cd9a3
+                    .quad 0x3e3a1a9f8ef43049
+                    .quad 0x3e4409d0276b3674
+                    .quad 0x3e20e2f613e85bd9
+                    .quad 0x3df0027433001e5f
+                    .quad 0x3e35dde2836d3265
+                    .quad 0x3e2300134d7aaf04
+                    .quad 0x3e3cb7e0b42724f5
+                    .quad 0x3e2d6e93167e6308
+                    .quad 0x3e3d1569b1526adb
+                    .quad 0x3e0e99fc338a1a41
+                    .quad 0x3e4eb01394a11b1c
+                    .quad 0x3e04f27a790e7c41
+                    .quad 0x3e25ce3ca97b7af9
+                    .quad 0x3e281f0f940ed857
+                    .quad 0x3e4d36295d88857c
+                    .quad 0x3e21aca1ec4af526
+                    .quad 0x3e445743c7182726
+                    .quad 0x3e23c491aead337e
+                    .quad 0x3e3aef401a738931
+                    .quad 0x3e21cede76092a29
+                    .quad 0x3e4fba8f44f82bb4
+                    .quad 0x3e446f5f7f3c3e1a
+                    .quad 0x3e47055f86c9674b
+                    .quad 0x3e4b41a92b6b6e1a
+                    .quad 0x3e443d162e927628
+                    .quad 0x3e4466174013f9b1
+                    .quad 0x3e3b05096ad69c62
+                    .quad 0x3e40b169150faa58
+                    .quad 0x3e3cd98b1df85da7
+                    .quad 0x3e468b507b0f8fa8
+                    .quad 0x3e48422df57499ba
+                    .quad 0x3e11351586970274
+                    .quad 0x3e117e08acba92ee
+                    .quad 0x3e26e04314dd0229
+                    .quad 0x3e497f3097e56d1a
+                    .quad 0x3e3356e655901286
+                    .quad 0x3e0cb761457f94d6
+                    .quad 0x3e39af67a85a9dac
+                    .quad 0x3e453410931a909f
+                    .quad 0x3e22c587206058f5
+                    .quad 0x3e223bc358899c22
+                    .quad 0x3e4d7bf8b6d223cb
+                    .quad 0x3e47991ec5197ddb
+                    .quad 0x3e4a79e6bb3a9219
+                    .quad 0x3e3a4c43ed663ec5
+                    .quad 0x3e461b5a1484f438
+                    .quad 0x3e4b4e36f7ef0c3a
+                    .quad 0x3e115f026acd0d1b
+                    .quad 0x3e3f36b535cecf05
+                    .quad 0x3e2ffb7fbf3eb5c6
+                    .quad 0x3e3e6a6886b09760
+                    .quad 0x3e3135eb27f5bbc3
+                    .quad 0x3e470be7d6f6fa57
+                    .quad 0x3e4ce43cc84ab338
+                    .quad 0x3e4c01d7aac3bd91
+                    .quad 0x3e45c58d07961060
+                    .quad 0x3e3628bcf941456e
+                    .quad 0x3e4c58b2a8461cd2
+                    .quad 0x3e33071282fb989a
+                    .quad 0x3e420dab6a80f09c
+                    .quad 0x3e44f8d84c397b1e
+                    .quad 0x3e40d0ee08599e48
+                    .quad 0x3e1d68787e37da36
+                    .quad 0x3e366187d591bafc
+                    .quad 0x3e22346600bae772
+                    .quad 0x3e390377d0d61b8e
+                    .quad 0x3e4f5e0dd966b907
+                    .quad 0x3e49023cb79a00e2
+                    .quad 0x3e44e05158c28ad8
+                    .quad 0x3e3bfa7b08b18ae4
+                    .quad 0x3e4ef1e63db35f67
+                    .quad 0x3e0ec2ae39493d4f
+                    .quad 0x3e40afe930ab2fa0
+                    .quad 0x3e225ff8a1810dd4
+                    .quad 0x3e469743fb1a71a5
+                    .quad 0x3e5f9cc676785571
+                    .quad 0x3e5b524da4cbf982
+                    .quad 0x3e5a4c8b381535b8
+                    .quad 0x3e5839be809caf2c
+                    .quad 0x3e50968a1cb82c13
+                    .quad 0x3e5eae6a41723fb5
+                    .quad 0x3e5d9c29a380a4db
+                    .quad 0x3e4094aa0ada625e
+                    .quad 0x3e5973ad6fc108ca
+                    .quad 0x3e4747322fdbab97
+                    .quad 0x3e593692fa9d4221
+                    .quad 0x3e5c5a992dfbc7d9
+                    .quad 0x3e4e1f33e102387a
+                    .quad 0x3e464fbef14c048c
+                    .quad 0x3e4490f513ca5e3b
+                    .quad 0x3e37a6af4d4c799d
+                    .quad 0x3e57574c1c07398f
+                    .quad 0x3e57b133417f8c1c
+                    .quad 0x3e5feb9e0c176514
+                    .quad 0x3e419f25bb3172f7
+                    .quad 0x3e45f68a7bbfb852
+                    .quad 0x3e5ee278497929f1
+                    .quad 0x3e5ccee006109d58
+                    .quad 0x3e5ce081a07bd8b3
+                    .quad 0x3e570e12981817b8
+                    .quad 0x3e292ab6d93503d0
+                    .quad 0x3e58cb7dd7c3b61e
+                    .quad 0x3e4efafd0a0b78da
+                    .quad 0x3e5e907267c4288e
+                    .quad 0x3e5d31ef96780875
+                    .quad 0x3e23430dfcd2ad50
+                    .quad 0x3e344d88d75bc1f9
+                    .quad 0x3e5bec0f055e04fc
+                    .quad 0x3e5d85611590b9ad
+                    .quad 0x3df320568e583229
+                    .quad 0x3e5a891d1772f538
+                    .quad 0x3e22edc9dabba74d
+                    .quad 0x3e4b9009a1015086
+                    .quad 0x3e52a12a8c5b1a19
+                    .quad 0x3e3a7885f0fdac85
+                    .quad 0x3e5f4ffcd43ac691
+                    .quad 0x3e52243ae2640aad
+                    .quad 0x3e546513299035d3
+                    .quad 0x3e5b39c3a62dd725
+                    .quad 0x3e5ba6dd40049f51
+                    .quad 0x3e451d1ed7177409
+                    .quad 0x3e5cb0f2fd7f5216
+                    .quad 0x3e3ab150cd4e2213
+                    .quad 0x3e5cfd7bf3193844
+                    .quad 0x3e53fff8455f1dbd
+                    .quad 0x3e5fee640b905fc9
+                    .quad 0x3e54e2adf548084c
+                    .quad 0x3e3b597adc1ecdd2
+                    .quad 0x3e4345bd096d3a75
+                    .quad 0x3e5101b9d2453c8b
+                    .quad 0x3e508ce55cc8c979
+                    .quad 0x3e5bbf017e595f71
+                    .quad 0x3e37ce733bd393dc
+                    .quad 0x3e233bb0a503f8a1
+                    .quad 0x3e30e2f613e85bd9
+                    .quad 0x3e5e67555a635b3c
+                    .quad 0x3e2ea88df73d5e8b
+                    .quad 0x3e3d17e03bda18a8
+                    .quad 0x3e5b607d76044f7e
+                    .quad 0x3e52adc4e71bc2fc
+                    .quad 0x3e5f99dc7362d1d9
+                    .quad 0x3e5473fa008e6a6a
+                    .quad 0x3e2b75bb09cb0985
+                    .quad 0x3e5ea04dd10b9aba
+                    .quad 0x3e5802d0d6979674
+                    .quad 0x3e174688ccd99094
+                    .quad 0x3e496f16abb9df22
+                    .quad 0x3e46e66df2aa374f
+                    .quad 0x3e4e66525ea4550a
+                    .quad 0x3e42d02f34f20cbd
+                    .quad 0x3e46cfce65047188
+                    .quad 0x3e39b78c842d58b8
+                    .quad 0x3e4735e624c24bc9
+                    .quad 0x3e47eba1f7dd1adf
+                    .quad 0x3e586b3e59f65355
+                    .quad 0x3e1ce38e637f1b4d
+                    .quad 0x3e58d82ec919edc7
+                    .quad 0x3e4c52648ddcfa37
+                    .quad 0x3e52482ceae1ac12
+                    .quad 0x3e55a312311aba4f
+                    .quad 0x3e411e236329f225
+                    .quad 0x3e5b48c8cd2f246c
+                    .quad 0x3e6efa39ef35793c
+                    .quad 0x0000000000000000
+
+.align 16
+.L__log_F_inv_head:
+                    .quad 0x4000000000000000
+                    .quad 0x3fffe00000000000
+                    .quad 0x3fffc00000000000
+                    .quad 0x3fffa00000000000
+                    .quad 0x3fff800000000000
+                    .quad 0x3fff600000000000
+                    .quad 0x3fff400000000000
+                    .quad 0x3fff200000000000
+                    .quad 0x3fff000000000000
+                    .quad 0x3ffee00000000000
+                    .quad 0x3ffec00000000000
+                    .quad 0x3ffea00000000000
+                    .quad 0x3ffe900000000000
+                    .quad 0x3ffe700000000000
+                    .quad 0x3ffe500000000000
+                    .quad 0x3ffe300000000000
+                    .quad 0x3ffe100000000000
+                    .quad 0x3ffe000000000000
+                    .quad 0x3ffde00000000000
+                    .quad 0x3ffdc00000000000
+                    .quad 0x3ffda00000000000
+                    .quad 0x3ffd900000000000
+                    .quad 0x3ffd700000000000
+                    .quad 0x3ffd500000000000
+                    .quad 0x3ffd400000000000
+                    .quad 0x3ffd200000000000
+                    .quad 0x3ffd000000000000
+                    .quad 0x3ffcf00000000000
+                    .quad 0x3ffcd00000000000
+                    .quad 0x3ffcb00000000000
+                    .quad 0x3ffca00000000000
+                    .quad 0x3ffc800000000000
+                    .quad 0x3ffc700000000000
+                    .quad 0x3ffc500000000000
+                    .quad 0x3ffc300000000000
+                    .quad 0x3ffc200000000000
+                    .quad 0x3ffc000000000000
+                    .quad 0x3ffbf00000000000
+                    .quad 0x3ffbd00000000000
+                    .quad 0x3ffbc00000000000
+                    .quad 0x3ffba00000000000
+                    .quad 0x3ffb900000000000
+                    .quad 0x3ffb700000000000
+                    .quad 0x3ffb600000000000
+                    .quad 0x3ffb400000000000
+                    .quad 0x3ffb300000000000
+                    .quad 0x3ffb200000000000
+                    .quad 0x3ffb000000000000
+                    .quad 0x3ffaf00000000000
+                    .quad 0x3ffad00000000000
+                    .quad 0x3ffac00000000000
+                    .quad 0x3ffaa00000000000
+                    .quad 0x3ffa900000000000
+                    .quad 0x3ffa800000000000
+                    .quad 0x3ffa600000000000
+                    .quad 0x3ffa500000000000
+                    .quad 0x3ffa400000000000
+                    .quad 0x3ffa200000000000
+                    .quad 0x3ffa100000000000
+                    .quad 0x3ffa000000000000
+                    .quad 0x3ff9e00000000000
+                    .quad 0x3ff9d00000000000
+                    .quad 0x3ff9c00000000000
+                    .quad 0x3ff9a00000000000
+                    .quad 0x3ff9900000000000
+                    .quad 0x3ff9800000000000
+                    .quad 0x3ff9700000000000
+                    .quad 0x3ff9500000000000
+                    .quad 0x3ff9400000000000
+                    .quad 0x3ff9300000000000
+                    .quad 0x3ff9200000000000
+                    .quad 0x3ff9000000000000
+                    .quad 0x3ff8f00000000000
+                    .quad 0x3ff8e00000000000
+                    .quad 0x3ff8d00000000000
+                    .quad 0x3ff8b00000000000
+                    .quad 0x3ff8a00000000000
+                    .quad 0x3ff8900000000000
+                    .quad 0x3ff8800000000000
+                    .quad 0x3ff8700000000000
+                    .quad 0x3ff8600000000000
+                    .quad 0x3ff8400000000000
+                    .quad 0x3ff8300000000000
+                    .quad 0x3ff8200000000000
+                    .quad 0x3ff8100000000000
+                    .quad 0x3ff8000000000000
+                    .quad 0x3ff7f00000000000
+                    .quad 0x3ff7e00000000000
+                    .quad 0x3ff7d00000000000
+                    .quad 0x3ff7b00000000000
+                    .quad 0x3ff7a00000000000
+                    .quad 0x3ff7900000000000
+                    .quad 0x3ff7800000000000
+                    .quad 0x3ff7700000000000
+                    .quad 0x3ff7600000000000
+                    .quad 0x3ff7500000000000
+                    .quad 0x3ff7400000000000
+                    .quad 0x3ff7300000000000
+                    .quad 0x3ff7200000000000
+                    .quad 0x3ff7100000000000
+                    .quad 0x3ff7000000000000
+                    .quad 0x3ff6f00000000000
+                    .quad 0x3ff6e00000000000
+                    .quad 0x3ff6d00000000000
+                    .quad 0x3ff6c00000000000
+                    .quad 0x3ff6b00000000000
+                    .quad 0x3ff6a00000000000
+                    .quad 0x3ff6900000000000
+                    .quad 0x3ff6800000000000
+                    .quad 0x3ff6700000000000
+                    .quad 0x3ff6600000000000
+                    .quad 0x3ff6500000000000
+                    .quad 0x3ff6400000000000
+                    .quad 0x3ff6300000000000
+                    .quad 0x3ff6200000000000
+                    .quad 0x3ff6100000000000
+                    .quad 0x3ff6000000000000
+                    .quad 0x3ff5f00000000000
+                    .quad 0x3ff5e00000000000
+                    .quad 0x3ff5d00000000000
+                    .quad 0x3ff5c00000000000
+                    .quad 0x3ff5b00000000000
+                    .quad 0x3ff5a00000000000
+                    .quad 0x3ff5900000000000
+                    .quad 0x3ff5800000000000
+                    .quad 0x3ff5800000000000
+                    .quad 0x3ff5700000000000
+                    .quad 0x3ff5600000000000
+                    .quad 0x3ff5500000000000
+                    .quad 0x3ff5400000000000
+                    .quad 0x3ff5300000000000
+                    .quad 0x3ff5200000000000
+                    .quad 0x3ff5100000000000
+                    .quad 0x3ff5000000000000
+                    .quad 0x3ff5000000000000
+                    .quad 0x3ff4f00000000000
+                    .quad 0x3ff4e00000000000
+                    .quad 0x3ff4d00000000000
+                    .quad 0x3ff4c00000000000
+                    .quad 0x3ff4b00000000000
+                    .quad 0x3ff4a00000000000
+                    .quad 0x3ff4a00000000000
+                    .quad 0x3ff4900000000000
+                    .quad 0x3ff4800000000000
+                    .quad 0x3ff4700000000000
+                    .quad 0x3ff4600000000000
+                    .quad 0x3ff4600000000000
+                    .quad 0x3ff4500000000000
+                    .quad 0x3ff4400000000000
+                    .quad 0x3ff4300000000000
+                    .quad 0x3ff4200000000000
+                    .quad 0x3ff4200000000000
+                    .quad 0x3ff4100000000000
+                    .quad 0x3ff4000000000000
+                    .quad 0x3ff3f00000000000
+                    .quad 0x3ff3e00000000000
+                    .quad 0x3ff3e00000000000
+                    .quad 0x3ff3d00000000000
+                    .quad 0x3ff3c00000000000
+                    .quad 0x3ff3b00000000000
+                    .quad 0x3ff3b00000000000
+                    .quad 0x3ff3a00000000000
+                    .quad 0x3ff3900000000000
+                    .quad 0x3ff3800000000000
+                    .quad 0x3ff3800000000000
+                    .quad 0x3ff3700000000000
+                    .quad 0x3ff3600000000000
+                    .quad 0x3ff3500000000000
+                    .quad 0x3ff3500000000000
+                    .quad 0x3ff3400000000000
+                    .quad 0x3ff3300000000000
+                    .quad 0x3ff3200000000000
+                    .quad 0x3ff3200000000000
+                    .quad 0x3ff3100000000000
+                    .quad 0x3ff3000000000000
+                    .quad 0x3ff3000000000000
+                    .quad 0x3ff2f00000000000
+                    .quad 0x3ff2e00000000000
+                    .quad 0x3ff2e00000000000
+                    .quad 0x3ff2d00000000000
+                    .quad 0x3ff2c00000000000
+                    .quad 0x3ff2b00000000000
+                    .quad 0x3ff2b00000000000
+                    .quad 0x3ff2a00000000000
+                    .quad 0x3ff2900000000000
+                    .quad 0x3ff2900000000000
+                    .quad 0x3ff2800000000000
+                    .quad 0x3ff2700000000000
+                    .quad 0x3ff2700000000000
+                    .quad 0x3ff2600000000000
+                    .quad 0x3ff2500000000000
+                    .quad 0x3ff2500000000000
+                    .quad 0x3ff2400000000000
+                    .quad 0x3ff2300000000000
+                    .quad 0x3ff2300000000000
+                    .quad 0x3ff2200000000000
+                    .quad 0x3ff2100000000000
+                    .quad 0x3ff2100000000000
+                    .quad 0x3ff2000000000000
+                    .quad 0x3ff2000000000000
+                    .quad 0x3ff1f00000000000
+                    .quad 0x3ff1e00000000000
+                    .quad 0x3ff1e00000000000
+                    .quad 0x3ff1d00000000000
+                    .quad 0x3ff1c00000000000
+                    .quad 0x3ff1c00000000000
+                    .quad 0x3ff1b00000000000
+                    .quad 0x3ff1b00000000000
+                    .quad 0x3ff1a00000000000
+                    .quad 0x3ff1900000000000
+                    .quad 0x3ff1900000000000
+                    .quad 0x3ff1800000000000
+                    .quad 0x3ff1800000000000
+                    .quad 0x3ff1700000000000
+                    .quad 0x3ff1600000000000
+                    .quad 0x3ff1600000000000
+                    .quad 0x3ff1500000000000
+                    .quad 0x3ff1500000000000
+                    .quad 0x3ff1400000000000
+                    .quad 0x3ff1300000000000
+                    .quad 0x3ff1300000000000
+                    .quad 0x3ff1200000000000
+                    .quad 0x3ff1200000000000
+                    .quad 0x3ff1100000000000
+                    .quad 0x3ff1100000000000
+                    .quad 0x3ff1000000000000
+                    .quad 0x3ff0f00000000000
+                    .quad 0x3ff0f00000000000
+                    .quad 0x3ff0e00000000000
+                    .quad 0x3ff0e00000000000
+                    .quad 0x3ff0d00000000000
+                    .quad 0x3ff0d00000000000
+                    .quad 0x3ff0c00000000000
+                    .quad 0x3ff0c00000000000
+                    .quad 0x3ff0b00000000000
+                    .quad 0x3ff0a00000000000
+                    .quad 0x3ff0a00000000000
+                    .quad 0x3ff0900000000000
+                    .quad 0x3ff0900000000000
+                    .quad 0x3ff0800000000000
+                    .quad 0x3ff0800000000000
+                    .quad 0x3ff0700000000000
+                    .quad 0x3ff0700000000000
+                    .quad 0x3ff0600000000000
+                    .quad 0x3ff0600000000000
+                    .quad 0x3ff0500000000000
+                    .quad 0x3ff0500000000000
+                    .quad 0x3ff0400000000000
+                    .quad 0x3ff0400000000000
+                    .quad 0x3ff0300000000000
+                    .quad 0x3ff0300000000000
+                    .quad 0x3ff0200000000000
+                    .quad 0x3ff0200000000000
+                    .quad 0x3ff0100000000000
+                    .quad 0x3ff0100000000000
+                    .quad 0x3ff0000000000000
+                    .quad 0x3ff0000000000000
+
+.align 16
+.L__log_F_inv_tail:
+                    .quad 0x0000000000000000
+                    .quad 0x3effe01fe01fe020
+                    .quad 0x3f1fc07f01fc07f0
+                    .quad 0x3f31caa01fa11caa
+                    .quad 0x3f3f81f81f81f820
+                    .quad 0x3f48856506ddaba6
+                    .quad 0x3f5196792909c560
+                    .quad 0x3f57d9108c2ad433
+                    .quad 0x3f5f07c1f07c1f08
+                    .quad 0x3f638ff08b1c03dd
+                    .quad 0x3f680f6603d980f6
+                    .quad 0x3f6d00f57403d5d0
+                    .quad 0x3f331abf0b7672a0
+                    .quad 0x3f506a965d43919b
+                    .quad 0x3f5ceb240795ceb2
+                    .quad 0x3f6522f3b834e67f
+                    .quad 0x3f6c3c3c3c3c3c3c
+                    .quad 0x3f3e01e01e01e01e
+                    .quad 0x3f575b8fe21a291c
+                    .quad 0x3f6403b9403b9404
+                    .quad 0x3f6cc0ed7303b5cc
+                    .quad 0x3f479118f3fc4da2
+                    .quad 0x3f5ed952e0b0ce46
+                    .quad 0x3f695900eae56404
+                    .quad 0x3f3d41d41d41d41d
+                    .quad 0x3f5cb28ff16c69ae
+                    .quad 0x3f696b1edd80e866
+                    .quad 0x3f4372e225fe30d9
+                    .quad 0x3f60ad12073615a2
+                    .quad 0x3f6cdb2c0397cdb3
+                    .quad 0x3f52cc157b864407
+                    .quad 0x3f664cb5f7148404
+                    .quad 0x3f3c71c71c71c71c
+                    .quad 0x3f6129a21a930b84
+                    .quad 0x3f6f1e0387f1e038
+                    .quad 0x3f5ad4e4ba80709b
+                    .quad 0x3f6c0e070381c0e0
+                    .quad 0x3f560fba1a362bb0
+                    .quad 0x3f6a5713280dee96
+                    .quad 0x3f53f59620f9ece9
+                    .quad 0x3f69f22983759f23
+                    .quad 0x3f5478ac63fc8d5c
+                    .quad 0x3f6ad87bb4671656
+                    .quad 0x3f578b8efbb8148c
+                    .quad 0x3f6d0369d0369d03
+                    .quad 0x3f5d212b601b3748
+                    .quad 0x3f0b2036406c80d9
+                    .quad 0x3f629663b24547d1
+                    .quad 0x3f4435e50d79435e
+                    .quad 0x3f67d0ff2920bc03
+                    .quad 0x3f55c06b15c06b16
+                    .quad 0x3f6e3a5f0fd7f954
+                    .quad 0x3f61dec0d4c77b03
+                    .quad 0x3f473289870ac52e
+                    .quad 0x3f6a034da034da03
+                    .quad 0x3f5d041da2292856
+                    .quad 0x3f3a41a41a41a41a
+                    .quad 0x3f68550f8a39409d
+                    .quad 0x3f5b4fe5e92c0686
+                    .quad 0x3f3a01a01a01a01a
+                    .quad 0x3f691d2a2067b23a
+                    .quad 0x3f5e7c5dada0b4e5
+                    .quad 0x3f468a7725080ce1
+                    .quad 0x3f6c49d4aa21b490
+                    .quad 0x3f63333333333333
+                    .quad 0x3f54bc363b03fccf
+                    .quad 0x3f2c9f01970e4f81
+                    .quad 0x3f697617c6ef5b25
+                    .quad 0x3f6161f9add3c0ca
+                    .quad 0x3f5319fe6cb39806
+                    .quad 0x3f2f693a1c451ab3
+                    .quad 0x3f6a9e240321a9e2
+                    .quad 0x3f63831f3831f383
+                    .quad 0x3f5949ebc4dcfc1c
+                    .quad 0x3f480c6980c6980c
+                    .quad 0x3f6f9d00c5fe7403
+                    .quad 0x3f69721ed7e75347
+                    .quad 0x3f6381ec0313381f
+                    .quad 0x3f5b97c2aec12653
+                    .quad 0x3f509ef3024ae3ba
+                    .quad 0x3f38618618618618
+                    .quad 0x3f6e0184f00c2780
+                    .quad 0x3f692ef5657dba52
+                    .quad 0x3f64940305494030
+                    .quad 0x3f60303030303030
+                    .quad 0x3f58060180601806
+                    .quad 0x3f5017f405fd017f
+                    .quad 0x3f412a8ad278e8dd
+                    .quad 0x3f17d05f417d05f4
+                    .quad 0x3f6d67245c02f7d6
+                    .quad 0x3f6a4411c1d986a9
+                    .quad 0x3f6754d76c7316df
+                    .quad 0x3f649902f149902f
+                    .quad 0x3f621023358c1a68
+                    .quad 0x3f5f7390d2a6c406
+                    .quad 0x3f5b2b0805d5b2b1
+                    .quad 0x3f5745d1745d1746
+                    .quad 0x3f53c31507fa32c4
+                    .quad 0x3f50a1fd1b7af017
+                    .quad 0x3f4bc36ce3e0453a
+                    .quad 0x3f4702e05c0b8170
+                    .quad 0x3f4300b79300b793
+                    .quad 0x3f3f76b4337c6cb1
+                    .quad 0x3f3a62681c860fb0
+                    .quad 0x3f36c16c16c16c17
+                    .quad 0x3f3490aa31a3cfc7
+                    .quad 0x3f33cd153729043e
+                    .quad 0x3f3473a88d0bfd2e
+                    .quad 0x3f36816816816817
+                    .quad 0x3f39f36016719f36
+                    .quad 0x3f3ec6a5122f9016
+                    .quad 0x3f427c29da5519cf
+                    .quad 0x3f4642c8590b2164
+                    .quad 0x3f4ab5c45606f00b
+                    .quad 0x3f4fd3b80b11fd3c
+                    .quad 0x3f52cda0c6ba4eaa
+                    .quad 0x3f56058160581606
+                    .quad 0x3f5990d0a4b7ef87
+                    .quad 0x3f5d6ee340579d6f
+                    .quad 0x3f60cf87d9c54a69
+                    .quad 0x3f6310572620ae4c
+                    .quad 0x3f65798c8ff522a2
+                    .quad 0x3f680ad602b580ad
+                    .quad 0x3f6ac3e24799546f
+                    .quad 0x3f6da46102b1da46
+                    .quad 0x3f15805601580560
+                    .quad 0x3f3ed3c506b39a23
+                    .quad 0x3f4cbdd3e2970f60
+                    .quad 0x3f55555555555555
+                    .quad 0x3f5c979aee0bf805
+                    .quad 0x3f621291e81fd58e
+                    .quad 0x3f65fead500a9580
+                    .quad 0x3f6a0fd5c5f02a3a
+                    .quad 0x3f6e45c223898adc
+                    .quad 0x3f35015015015015
+                    .quad 0x3f4c7b16ea64d422
+                    .quad 0x3f57829cbc14e5e1
+                    .quad 0x3f60877db8589720
+                    .quad 0x3f65710e4b5edcea
+                    .quad 0x3f6a7dbb4d1fc1c8
+                    .quad 0x3f6fad40a57eb503
+                    .quad 0x3f43fd6bb00a5140
+                    .quad 0x3f54e78ecb419ba9
+                    .quad 0x3f600a44029100a4
+                    .quad 0x3f65c28f5c28f5c3
+                    .quad 0x3f6b9c68b2c0cc4a
+                    .quad 0x3f2978feb9f34381
+                    .quad 0x3f4ecf163bb6500a
+                    .quad 0x3f5be1958b67ebb9
+                    .quad 0x3f644e6157dc9a3b
+                    .quad 0x3f6acc4baa3f0ddf
+                    .quad 0x3f26a4cbcb2a247b
+                    .quad 0x3f50505050505050
+                    .quad 0x3f5e0b4439959819
+                    .quad 0x3f66027f6027f602
+                    .quad 0x3f6d1e854b5e0db4
+                    .quad 0x3f4165e7254813e2
+                    .quad 0x3f576646a9d716ef
+                    .quad 0x3f632b48f757ce88
+                    .quad 0x3f6ac1b24652a906
+                    .quad 0x3f33b13b13b13b14
+                    .quad 0x3f5490e1eb208984
+                    .quad 0x3f62385830fec66e
+                    .quad 0x3f6a45a6cc111b7e
+                    .quad 0x3f33813813813814
+                    .quad 0x3f556f472517b708
+                    .quad 0x3f631be7bc0e8f2a
+                    .quad 0x3f6b9cbf3e55f044
+                    .quad 0x3f40e7d95bc609a9
+                    .quad 0x3f59e6b3804d19e7
+                    .quad 0x3f65c8b6af7963c2
+                    .quad 0x3f6eb9dad43bf402
+                    .quad 0x3f4f1a515885fb37
+                    .quad 0x3f60eeb1d3d76c02
+                    .quad 0x3f6a320261a32026
+                    .quad 0x3f3c82ac40260390
+                    .quad 0x3f5a12f684bda12f
+                    .quad 0x3f669d43fda2962c
+                    .quad 0x3f02e025c04b8097
+                    .quad 0x3f542804b542804b
+                    .quad 0x3f63f69b02593f6a
+                    .quad 0x3f6df31cb46e21fa
+                    .quad 0x3f5012b404ad012b
+                    .quad 0x3f623925e7820a7f
+                    .quad 0x3f6c8253c8253c82
+                    .quad 0x3f4b92ddc02526e5
+                    .quad 0x3f61602511602511
+                    .quad 0x3f6bf471439c9adf
+                    .quad 0x3f4a85c40939a85c
+                    .quad 0x3f6166f9ac024d16
+                    .quad 0x3f6c44e10125e227
+                    .quad 0x3f4cebf48bbd90e5
+                    .quad 0x3f62492492492492
+                    .quad 0x3f6d6f2e2ec0b673
+                    .quad 0x3f5159e26af37c05
+                    .quad 0x3f64024540245402
+                    .quad 0x3f6f6f0243f6f024
+                    .quad 0x3f55e60121579805
+                    .quad 0x3f668e18cf81b10f
+                    .quad 0x3f32012012012012
+                    .quad 0x3f5c11f7047dc11f
+                    .quad 0x3f69e878ff70985e
+                    .quad 0x3f4779d9fdc3a219
+                    .quad 0x3f61eace5c957907
+                    .quad 0x3f6e0d5b450239e1
+                    .quad 0x3f548bf073816367
+                    .quad 0x3f6694808dda5202
+                    .quad 0x3f37c67f2bae2b21
+                    .quad 0x3f5ee58469ee5847
+                    .quad 0x3f6c0233c0233c02
+                    .quad 0x3f514e02328a7012
+                    .quad 0x3f6561072057b573
+                    .quad 0x3f31811811811812
+                    .quad 0x3f5e28646f5a1060
+                    .quad 0x3f6c0d1284e6f1d7
+                    .quad 0x3f523543f0c80459
+                    .quad 0x3f663cbeea4e1a09
+                    .quad 0x3f3b9a3fdd5c8cb8
+                    .quad 0x3f60be1c159a76d2
+                    .quad 0x3f6e1d1a688e4838
+                    .quad 0x3f572044d72044d7
+                    .quad 0x3f691713db81577b
+                    .quad 0x3f4ac73ae9819b50
+                    .quad 0x3f6460334e904cf6
+                    .quad 0x3f31111111111111
+                    .quad 0x3f5feef80441fef0
+                    .quad 0x3f6de021fde021fe
+                    .quad 0x3f57b7eacc9686a0
+                    .quad 0x3f69ead7cd391fbc
+                    .quad 0x3f50195609804390
+                    .quad 0x3f6641511e8d2b32
+                    .quad 0x3f4222b1acf1ce96
+                    .quad 0x3f62e29f79b47582
+                    .quad 0x3f24f0d1682e11cd
+                    .quad 0x3f5f9bb096771e4d
+                    .quad 0x3f6e5ee45dd96ae2
+                    .quad 0x3f5a0429a0429a04
+                    .quad 0x3f6bb74d5f06c021
+                    .quad 0x3f54fce404254fce
+                    .quad 0x3f695766eacbc402
+                    .quad 0x3f50842108421084
+                    .quad 0x3f673e5371d5c338
+                    .quad 0x3f4930523fbe3368
+                    .quad 0x3f656b38f225f6c4
+                    .quad 0x3f426e978d4fdf3b
+                    .quad 0x3f63dd40e4eb0cc6
+                    .quad 0x3f397f7d73404146
+                    .quad 0x3f6293982cc98af1
+                    .quad 0x3f30410410410410
+                    .quad 0x3f618d6f048ff7e4
+                    .quad 0x3f2236a3ebc349de
+                    .quad 0x3f60c9f8ee53d18c
+                    .quad 0x3f10204081020408
+                    .quad 0x3f60486ca2f46ea6
+                    .quad 0x3ef0101010101010
+                    .quad 0x3f60080402010080
+                    .quad 0x0000000000000000
+
+#---------------------
+# exp data
+#---------------------
+
+.align 16
+
+.L__denormal_threshold:         .long 0x0fffffc02 # -1022
+                                .long 0
+                                .quad 0
+
+.L__enable_almost_inf:          .quad 0x7fe0000000000000
+                                .quad 0
+
+.L__real_zero:                  .quad 0x0000000000000000
+                                .quad 0
+
+.L__real_smallest_denormal:     .quad 0x0000000000000001
+                                .quad 0
+.L__denormal_tiny_threshold:    .quad 0x0c0874046dfefd9d0
+                                .quad 0
+
+.L__real_p65536:                .quad 0x40f0000000000000    # 65536
+                                .quad 0
+.L__real_m68800:                .quad 0x0c0f0cc0000000000   # -68800
+                                .quad 0
+.L__real_64_by_log2:            .quad 0x40571547652b82fe    # 64/ln(2)
+                                .quad 0
+.L__real_log2_by_64_head:       .quad 0x3f862e42f0000000    # log2_by_64_head
+                                .quad 0
+.L__real_log2_by_64_tail:       .quad 0x0bdfdf473de6af278   # -log2_by_64_tail
+                                .quad 0
+.L__real_1_by_720:              .quad 0x3f56c16c16c16c17    # 1/720
+                                .quad 0
+.L__real_1_by_120:              .quad 0x3f81111111111111    # 1/120
+                                .quad 0
+.L__real_1_by_24:               .quad 0x3fa5555555555555    # 1/24
+                                .quad 0
+.L__real_1_by_6:                .quad 0x3fc5555555555555    # 1/6
+                                .quad 0
+.L__real_1_by_2:                .quad 0x3fe0000000000000    # 1/2
+                                .quad 0
+
+.align 16
+.L__two_to_jby64_head_table:
+    .quad 0x3ff0000000000000
+    .quad 0x3ff02c9a30000000
+    .quad 0x3ff059b0d0000000
+    .quad 0x3ff0874510000000
+    .quad 0x3ff0b55860000000
+    .quad 0x3ff0e3ec30000000
+    .quad 0x3ff11301d0000000
+    .quad 0x3ff1429aa0000000
+    .quad 0x3ff172b830000000
+    .quad 0x3ff1a35be0000000
+    .quad 0x3ff1d48730000000
+    .quad 0x3ff2063b80000000
+    .quad 0x3ff2387a60000000
+    .quad 0x3ff26b4560000000
+    .quad 0x3ff29e9df0000000
+    .quad 0x3ff2d285a0000000
+    .quad 0x3ff306fe00000000
+    .quad 0x3ff33c08b0000000
+    .quad 0x3ff371a730000000
+    .quad 0x3ff3a7db30000000
+    .quad 0x3ff3dea640000000
+    .quad 0x3ff4160a20000000
+    .quad 0x3ff44e0860000000
+    .quad 0x3ff486a2b0000000
+    .quad 0x3ff4bfdad0000000
+    .quad 0x3ff4f9b270000000
+    .quad 0x3ff5342b50000000
+    .quad 0x3ff56f4730000000
+    .quad 0x3ff5ab07d0000000
+    .quad 0x3ff5e76f10000000
+    .quad 0x3ff6247eb0000000
+    .quad 0x3ff6623880000000
+    .quad 0x3ff6a09e60000000
+    .quad 0x3ff6dfb230000000
+    .quad 0x3ff71f75e0000000
+    .quad 0x3ff75feb50000000
+    .quad 0x3ff7a11470000000
+    .quad 0x3ff7e2f330000000
+    .quad 0x3ff8258990000000
+    .quad 0x3ff868d990000000
+    .quad 0x3ff8ace540000000
+    .quad 0x3ff8f1ae90000000
+    .quad 0x3ff93737b0000000
+    .quad 0x3ff97d8290000000
+    .quad 0x3ff9c49180000000
+    .quad 0x3ffa0c6670000000
+    .quad 0x3ffa5503b0000000
+    .quad 0x3ffa9e6b50000000
+    .quad 0x3ffae89f90000000
+    .quad 0x3ffb33a2b0000000
+    .quad 0x3ffb7f76f0000000
+    .quad 0x3ffbcc1e90000000
+    .quad 0x3ffc199bd0000000
+    .quad 0x3ffc67f120000000
+    .quad 0x3ffcb720d0000000
+    .quad 0x3ffd072d40000000
+    .quad 0x3ffd5818d0000000
+    .quad 0x3ffda9e600000000
+    .quad 0x3ffdfc9730000000
+    .quad 0x3ffe502ee0000000
+    .quad 0x3ffea4afa0000000
+    .quad 0x3ffefa1be0000000
+    .quad 0x3fff507650000000
+    .quad 0x3fffa7c180000000
+
+.align 16
+.L__two_to_jby64_tail_table:
+    .quad 0x0000000000000000
+    .quad 0x3e6cef00c1dcdef9
+    .quad 0x3e48ac2ba1d73e2a
+    .quad 0x3e60eb37901186be
+    .quad 0x3e69f3121ec53172
+    .quad 0x3e469e8d10103a17
+    .quad 0x3df25b50a4ebbf1a
+    .quad 0x3e6d525bbf668203
+    .quad 0x3e68faa2f5b9bef9
+    .quad 0x3e66df96ea796d31
+    .quad 0x3e368b9aa7805b80
+    .quad 0x3e60c519ac771dd6
+    .quad 0x3e6ceac470cd83f5
+    .quad 0x3e5789f37495e99c
+    .quad 0x3e547f7b84b09745
+    .quad 0x3e5b900c2d002475
+    .quad 0x3e64636e2a5bd1ab
+    .quad 0x3e4320b7fa64e430
+    .quad 0x3e5ceaa72a9c5154
+    .quad 0x3e53967fdba86f24
+    .quad 0x3e682468446b6824
+    .quad 0x3e3f72e29f84325b
+    .quad 0x3e18624b40c4dbd0
+    .quad 0x3e5704f3404f068e
+    .quad 0x3e54d8a89c750e5e
+    .quad 0x3e5a74b29ab4cf62
+    .quad 0x3e5a753e077c2a0f
+    .quad 0x3e5ad49f699bb2c0
+    .quad 0x3e6a90a852b19260
+    .quad 0x3e56b48521ba6f93
+    .quad 0x3e0d2ac258f87d03
+    .quad 0x3e42a91124893ecf
+    .quad 0x3e59fcef32422cbe
+    .quad 0x3e68ca345de441c5
+    .quad 0x3e61d8bee7ba46e1
+    .quad 0x3e59099f22fdba6a
+    .quad 0x3e4f580c36bea881
+    .quad 0x3e5b3d398841740a
+    .quad 0x3e62999c25159f11
+    .quad 0x3e668925d901c83b
+    .quad 0x3e415506dadd3e2a
+    .quad 0x3e622aee6c57304e
+    .quad 0x3e29b8bc9e8a0387
+    .quad 0x3e6fbc9c9f173d24
+    .quad 0x3e451f8480e3e235
+    .quad 0x3e66bbcac96535b5
+    .quad 0x3e41f12ae45a1224
+    .quad 0x3e55e7f6fd0fac90
+    .quad 0x3e62b5a75abd0e69
+    .quad 0x3e609e2bf5ed7fa1
+    .quad 0x3e47daf237553d84
+    .quad 0x3e12f074891ee83d
+    .quad 0x3e6b0aa538444196
+    .quad 0x3e6cafa29694426f
+    .quad 0x3e69df20d22a0797
+    .quad 0x3e640f12f71a1e45
+    .quad 0x3e69f7490e4bb40b
+    .quad 0x3e4ed9942b84600d
+    .quad 0x3e4bdcdaf5cb4656
+    .quad 0x3e5e2cffd89cf44c
+    .quad 0x3e452486cc2c7b9d
+    .quad 0x3e6cc2b44eee3fa4
+    .quad 0x3e66dc8a80ce9f09
+    .quad 0x3e39e90d82e90a7e
+
+
+#endif

diff --git a/src/gas/powf.S b/src/gas/powf.S
new file mode 100644
index 0000000..96eefd2
--- /dev/null
+++ b/src/gas/powf.S

@@ -0,0 +1,1040 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# powf.S
+#
+# An implementation of the powf libm function.
+#
+# Prototype:
+#
+#     float powf(float x, float y);
+#
+
+#
+#   Algorithm:
+#       x^y = e^(y*ln(x))
+#
+#       Look in exp, log for the respective algorithms
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(powf)
+#define fname_special _powf_special@PLT
+
+
+# local variable storage offsets
+.equ    save_x, 0x0
+.equ    save_y, 0x10
+.equ    p_temp_exp, 0x20
+.equ    negate_result, 0x30
+.equ    save_ax, 0x40
+.equ    y_head, 0x50
+.equ    p_temp_log, 0x60
+.equ    stack_size, 0x78
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+    sub         $stack_size, %rsp
+
+    movss       %xmm0, save_x(%rsp)
+    movss       %xmm1, save_y(%rsp)
+
+    mov         save_x(%rsp), %edx
+    mov         save_y(%rsp), %r8d
+
+    mov         .L__f32_exp_mant_mask(%rip), %r10d
+    and         %r8d, %r10d
+    jz          .L__y_is_zero
+
+    cmp         .L__f32_pos_one(%rip), %r8d
+    je          .L__y_is_one
+
+    mov         .L__f32_sign_mask(%rip), %r9d
+    and         %edx, %r9d
+    cmp         .L__f32_sign_mask(%rip), %r9d
+    mov         .L__f32_pos_zero(%rip), %eax
+    mov         %eax, negate_result(%rsp)    
+    je          .L__x_is_neg
+
+    cmp         .L__f32_pos_one(%rip), %edx
+    je          .L__x_is_pos_one
+
+    cmp         .L__f32_pos_zero(%rip), %edx
+    je          .L__x_is_zero
+
+    mov         .L__f32_exp_mask(%rip), %r9d
+    and         %edx, %r9d
+    cmp         .L__f32_exp_mask(%rip), %r9d
+    je          .L__x_is_inf_or_nan
+   
+    mov         .L__f32_exp_mask(%rip), %r10d
+    and         %r8d, %r10d
+    cmp         .L__f32_ay_max_bound(%rip), %r10d
+    jg          .L__ay_is_very_large
+
+    mov         .L__f32_exp_mask(%rip), %r10d
+    and         %r8d, %r10d
+    cmp         .L__f32_ay_min_bound(%rip), %r10d
+    jl          .L__ay_is_very_small
+
+    # -----------------------------
+    # compute log(x) here
+    # -----------------------------
+.L__log_x:
+
+    movss       save_y(%rsp), %xmm7
+    cvtss2sd    %xmm0, %xmm0
+    cvtss2sd    %xmm7, %xmm7
+    movsd       %xmm7, save_y(%rsp)
+
+    # compute exponent part
+    xor         %r8, %r8
+    movdqa      %xmm0, %xmm3
+    psrlq       $52, %xmm3
+    movd        %xmm0, %r8
+    psubq       .L__mask_1023(%rip), %xmm3
+    movdqa      %xmm0, %xmm2
+    cvtdq2pd    %xmm3, %xmm6 # xexp
+    pand        .L__real_mant(%rip), %xmm2
+
+    # compute index into the log tables
+    mov         %r8, %r9
+    and         .L__mask_mant_all7(%rip), %r8
+    and         .L__mask_mant8(%rip), %r9
+    shl         %r9
+    add         %r9, %r8
+    mov         %r8, p_temp_log(%rsp)
+
+    # F, Y
+    movsd       p_temp_log(%rsp), %xmm1
+    shr         $45, %r8
+    por         .L__real_half(%rip), %xmm2
+    por         .L__real_half(%rip), %xmm1
+    lea         .L__log_F_inv(%rip), %r9
+
+    # f = F - Y, r = f * inv
+    subsd       %xmm2, %xmm1
+    mulsd       (%r9,%r8,8), %xmm1
+    movsd       %xmm1, %xmm2
+
+    lea         .L__log_128_table(%rip), %r9
+    movsd       .L__real_log2(%rip), %xmm5
+    movsd       (%r9,%r8,8), %xmm0
+
+    # poly
+    mulsd       %xmm2, %xmm1
+    movsd       .L__real_1_over_4(%rip), %xmm4
+    movsd       .L__real_1_over_2(%rip), %xmm3
+    mulsd       %xmm2, %xmm4
+    mulsd       %xmm2, %xmm3
+    mulsd       %xmm2, %xmm1
+    addsd       .L__real_1_over_3(%rip), %xmm4
+    addsd       .L__real_1_over_1(%rip), %xmm3
+    mulsd       %xmm1, %xmm4
+    mulsd       %xmm2, %xmm3
+    addsd       %xmm4, %xmm3  
+
+    mulsd       %xmm6, %xmm5
+    subsd       %xmm3, %xmm0
+    addsd       %xmm5, %xmm0
+
+    movsd       save_y(%rsp), %xmm7
+    mulsd       %xmm7, %xmm0
+ 
+    # v = y * ln(x)
+    # xmm0 - v
+
+    # -----------------------------
+    # compute exp( y * ln(x) ) here
+    # -----------------------------
+
+    # x * (32/ln(2))
+    movsd       .L__real_32_by_log2(%rip), %xmm7
+    movsd       %xmm0, p_temp_exp(%rsp)
+    mulsd       %xmm0, %xmm7
+    mov         p_temp_exp(%rsp), %rdx
+
+    # v < 128*ln(2), ( v * (32/ln(2)) ) < 32*128
+    # v >= -150*ln(2), ( v * (32/ln(2)) ) >= 32*(-150)
+    comisd      .L__real_p4096(%rip), %xmm7
+    jae         .L__process_result_inf
+
+    comisd      .L__real_m4768(%rip), %xmm7
+    jb          .L__process_result_zero
+
+    # n = int( v * (32/ln(2)) )
+    cvtpd2dq    %xmm7, %xmm4
+    lea         .L__two_to_jby32_table(%rip), %r10
+    cvtdq2pd    %xmm4, %xmm1
+
+    # r = x - n * ln(2)/32
+    movsd       .L__real_log2_by_32(%rip), %xmm2
+    mulsd       %xmm1, %xmm2
+    movd        %xmm4, %ecx
+    mov         $0x1f, %rax
+    and         %ecx, %eax
+    subsd       %xmm2, %xmm0
+    movsd       %xmm0, %xmm1
+
+    # m = (n - j) / 32
+    sub         %eax, %ecx
+    sar         $5, %ecx
+
+    # q
+    mulsd       %xmm0, %xmm1
+    movsd       .L__real_1_by_24(%rip), %xmm4
+    movsd       .L__real_1_by_2(%rip), %xmm3
+    mulsd       %xmm0, %xmm4
+    mulsd       %xmm0, %xmm3
+    mulsd       %xmm0, %xmm1
+    addsd       .L__real_1_by_6(%rip), %xmm4
+    addsd       .L__real_1_by_1(%rip), %xmm3
+    mulsd       %xmm1, %xmm4
+    mulsd       %xmm0, %xmm3
+    addsd       %xmm4, %xmm3  
+    movsd       %xmm3, %xmm0
+
+    add         $1023, %rcx
+    shl         $52, %rcx
+
+    # (f)*(1+q)
+    movsd       (%r10,%rax,8), %xmm1
+    mulsd       %xmm1, %xmm0
+    addsd       %xmm1, %xmm0
+
+    mov         %rcx, p_temp_exp(%rsp)
+    mulsd       p_temp_exp(%rsp), %xmm0
+    cvtsd2ss    %xmm0, %xmm0
+    orps        negate_result(%rsp), %xmm0
+
+.L__final_check:
+    add         $stack_size, %rsp
+    ret
+
+.p2align 4,,15
+.L__process_result_zero:
+    mov         .L__f32_real_zero(%rip), %r11d
+    or          negate_result(%rsp), %r11d
+    jmp         .L__z_is_zero_or_inf
+ 
+.p2align 4,,15
+.L__process_result_inf:
+    mov         .L__f32_real_inf(%rip), %r11d
+    or          negate_result(%rsp), %r11d
+    jmp         .L__z_is_zero_or_inf
+
+
+.p2align 4,,15
+.L__x_is_neg:
+
+    mov         .L__f32_exp_mask(%rip), %r10d
+    and         %r8d, %r10d
+    cmp         .L__f32_ay_max_bound(%rip), %r10d
+    jg          .L__ay_is_very_large
+
+    # determine if y is an integer
+    mov         .L__f32_exp_mant_mask(%rip), %r10d
+    and         %r8d, %r10d
+    mov         %r10d, %r11d
+    mov         .L__f32_exp_shift(%rip), %ecx
+    shr         %cl, %r10d
+    sub         .L__f32_exp_bias(%rip), %r10d
+    js          .L__x_is_neg_y_is_not_int
+   
+    mov         .L__f32_exp_mant_mask(%rip), %eax
+    and         %edx, %eax
+    mov         %eax, save_ax(%rsp)
+
+    cmp         .L__yexp_24(%rip), %r10d
+    mov         %r10d, %ecx
+    jg          .L__continue_after_y_int_check
+
+    mov         .L__f32_mant_full(%rip), %r9d
+    shr         %cl, %r9d
+    and         %r11d, %r9d
+    jnz         .L__x_is_neg_y_is_not_int
+
+    mov         .L__f32_1_before_mant(%rip), %r9d
+    shr         %cl, %r9d
+    and         %r11d, %r9d
+    jz          .L__continue_after_y_int_check
+
+    mov         .L__f32_sign_mask(%rip), %eax
+    mov         %eax, negate_result(%rsp)    
+
+.L__continue_after_y_int_check:
+
+    cmp         .L__f32_neg_zero(%rip), %edx
+    je          .L__x_is_zero
+
+    cmp         .L__f32_neg_one(%rip), %edx
+    je          .L__x_is_neg_one
+
+    mov         .L__f32_exp_mask(%rip), %r9d
+    and         %edx, %r9d
+    cmp         .L__f32_exp_mask(%rip), %r9d
+    je          .L__x_is_inf_or_nan
+   
+    movss       save_ax(%rsp), %xmm0
+    jmp         .L__log_x
+
+.p2align 4,,15
+.L__x_is_pos_one:
+    xor         %eax, %eax
+    mov         .L__f32_exp_mask(%rip), %r10d
+    and         %r8d, %r10d
+    cmp         .L__f32_exp_mask(%rip), %r10d
+    cmove       %r8d, %eax
+    mov         .L__f32_mant_mask(%rip), %r10d
+    and         %eax, %r10d
+    jz          .L__final_check
+
+    mov         .L__f32_qnan_set(%rip), %r10d
+    and         %r8d, %r10d
+    jnz         .L__final_check
+
+    movss       save_x(%rsp), %xmm0
+    movss       save_y(%rsp), %xmm1
+    movss       .L__f32_pos_one(%rip), %xmm2
+    mov         .L__flag_x_one_y_snan(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check      
+
+.p2align 4,,15
+.L__y_is_zero:
+
+    xor         %eax, %eax
+    mov         .L__f32_exp_mask(%rip), %r9d
+    mov         .L__f32_pos_one(%rip), %r11d
+    and         %edx, %r9d
+    cmp         .L__f32_exp_mask(%rip), %r9d
+    cmove       %edx, %eax
+    mov         .L__f32_mant_mask(%rip), %r9d
+    and         %eax, %r9d
+    jnz         .L__x_is_nan
+
+    movss       .L__f32_pos_one(%rip), %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__y_is_one:
+    xor         %eax, %eax
+    mov         %edx, %r11d
+    mov         .L__f32_exp_mask(%rip), %r9d
+    or          .L__f32_qnan_set(%rip), %r11d
+    and         %edx, %r9d
+    cmp         .L__f32_exp_mask(%rip), %r9d
+    cmove       %edx, %eax
+    mov         .L__f32_mant_mask(%rip), %r9d
+    and         %eax, %r9d
+    jnz         .L__x_is_nan
+
+    movd        %edx, %xmm0 
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_neg_one:
+    mov         .L__f32_pos_one(%rip), %edx
+    or          negate_result(%rsp), %edx
+    xor         %eax, %eax
+    mov         %r8d, %r11d
+    mov         .L__f32_exp_mask(%rip), %r10d
+    or          .L__f32_qnan_set(%rip), %r11d
+    and         %r8d, %r10d
+    cmp         .L__f32_exp_mask(%rip), %r10d
+    cmove       %r8d, %eax
+    mov         .L__f32_mant_mask(%rip), %r10d
+    and         %eax, %r10d
+    jnz         .L__y_is_nan
+
+    movd        %edx, %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_neg_y_is_not_int:
+    mov         .L__f32_exp_mask(%rip), %r9d
+    and         %edx, %r9d
+    cmp         .L__f32_exp_mask(%rip), %r9d
+    je          .L__x_is_inf_or_nan
+
+    cmp         .L__f32_neg_zero(%rip), %edx
+    je          .L__x_is_zero
+
+    movss       save_x(%rsp), %xmm0
+    movss       save_y(%rsp), %xmm1
+    movss       .L__f32_qnan(%rip), %xmm2
+    mov         .L__flag_x_neg_y_notint(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__ay_is_very_large:
+    mov         .L__f32_exp_mask(%rip), %r9d
+    and         %edx, %r9d
+    cmp         .L__f32_exp_mask(%rip), %r9d
+    je          .L__x_is_inf_or_nan
+
+    mov         .L__f32_exp_mant_mask(%rip), %r9d
+    and         %edx, %r9d
+    jz          .L__x_is_zero 
+
+    cmp         .L__f32_neg_one(%rip), %edx
+    je          .L__x_is_neg_one
+
+    mov         %edx, %r9d
+    and         .L__f32_exp_mant_mask(%rip), %r9d
+    cmp         .L__f32_pos_one(%rip), %r9d
+    jl          .L__ax_lt1_y_is_large_or_inf_or_nan
+  
+    jmp         .L__ax_gt1_y_is_large_or_inf_or_nan
+
+.p2align 4,,15
+.L__x_is_zero:
+    mov         .L__f32_exp_mask(%rip), %r10d
+    xor         %eax, %eax
+    and         %r8d, %r10d
+    cmp         .L__f32_exp_mask(%rip), %r10d
+    je          .L__x_is_zero_y_is_inf_or_nan
+
+    mov         .L__f32_sign_mask(%rip), %r10d
+    and         %r8d, %r10d
+    cmovnz      .L__f32_pos_inf(%rip), %eax
+    jnz         .L__x_is_zero_z_is_inf
+
+    movd        %eax, %xmm0
+    orps        negate_result(%rsp), %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_zero_z_is_inf:
+
+    movss       save_x(%rsp), %xmm0
+    movss       save_y(%rsp), %xmm1
+    movd        %eax, %xmm2
+    orps        negate_result(%rsp), %xmm2
+    mov         .L__flag_x_zero_z_inf(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check    
+
+.p2align 4,,15
+.L__x_is_zero_y_is_inf_or_nan:
+    mov         %r8d, %r11d
+    cmp         .L__f32_neg_inf(%rip), %r8d
+    cmove       .L__f32_pos_inf(%rip), %eax
+    je          .L__x_is_zero_z_is_inf
+
+    or          .L__f32_qnan_set(%rip), %r11d
+    mov         .L__f32_mant_mask(%rip), %r10d
+    and         %r8d, %r10d
+    jnz         .L__y_is_nan
+
+    movd        %eax, %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+    xor         %r11d, %r11d
+    mov         .L__f32_sign_mask(%rip), %r10d
+    and         %r8d, %r10d
+    cmovz       .L__f32_pos_inf(%rip), %r11d
+    mov         %edx, %eax
+    mov         .L__f32_mant_mask(%rip), %r9d
+    or          .L__f32_qnan_set(%rip), %eax
+    and         %edx, %r9d
+    cmovnz      %eax, %r11d
+    jnz         .L__x_is_nan
+
+    xor         %eax, %eax
+    mov         %r8d, %r9d
+    mov         .L__f32_exp_mask(%rip), %r10d
+    or          .L__f32_qnan_set(%rip), %r9d
+    and         %r8d, %r10d
+    cmp         .L__f32_exp_mask(%rip), %r10d
+    cmove       %r8d, %eax
+    mov         .L__f32_mant_mask(%rip), %r10d
+    and         %eax, %r10d
+    cmovnz      %r9d, %r11d
+    jnz         .L__y_is_nan
+
+    movd        %r11d, %xmm0
+    orps        negate_result(%rsp), %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__ay_is_very_small:
+    movss       .L__f32_pos_one(%rip), %xmm0
+    addss       %xmm1, %xmm0
+    jmp         .L__final_check
+
+
+.p2align 4,,15
+.L__ax_lt1_y_is_large_or_inf_or_nan:
+    xor         %r11d, %r11d
+    mov         .L__f32_sign_mask(%rip), %r10d
+    and         %r8d, %r10d
+    cmovnz      .L__f32_pos_inf(%rip), %r11d
+    jmp         .L__adjust_for_nan
+
+.p2align 4,,15
+.L__ax_gt1_y_is_large_or_inf_or_nan:
+    xor         %r11d, %r11d
+    mov         .L__f32_sign_mask(%rip), %r10d
+    and         %r8d, %r10d
+    cmovz       .L__f32_pos_inf(%rip), %r11d
+
+.p2align 4,,15
+.L__adjust_for_nan:
+
+    xor         %eax, %eax
+    mov         %r8d, %r9d
+    mov         .L__f32_exp_mask(%rip), %r10d
+    or          .L__f32_qnan_set(%rip), %r9d
+    and         %r8d, %r10d
+    cmp         .L__f32_exp_mask(%rip), %r10d
+    cmove       %r8d, %eax
+    mov         .L__f32_mant_mask(%rip), %r10d
+    and         %eax, %r10d
+    cmovnz      %r9d, %r11d
+    jnz         .L__y_is_nan
+
+    test        %eax, %eax
+    jnz         .L__y_is_inf
+
+.p2align 4,,15
+.L__z_is_zero_or_inf:
+
+    mov         .L__flag_z_zero(%rip), %edi
+    test        %r11d, %r11d
+    cmovnz      .L__flag_z_inf(%rip), %edi
+    
+    movss       save_x(%rsp), %xmm0
+    movss       save_y(%rsp), %xmm1
+    movd        %r11d, %xmm2
+
+    call        fname_special
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__y_is_inf:
+
+    movd        %r11d, %xmm0
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_nan:
+
+    xor         %eax, %eax
+    mov         .L__f32_exp_mask(%rip), %r10d
+    and         %r8d, %r10d
+    cmp         .L__f32_exp_mask(%rip), %r10d
+    cmove       %r8d, %eax
+    mov         .L__f32_mant_mask(%rip), %r10d
+    and         %eax, %r10d
+    jnz         .L__x_is_nan_y_is_nan
+
+    mov         .L__f32_qnan_set(%rip), %r9d
+    and         %edx, %r9d
+    movd        %r11d, %xmm0
+    jnz         .L__final_check
+
+    movss       save_x(%rsp), %xmm0
+    movss       save_y(%rsp), %xmm1
+    movd        %r11d, %xmm2
+    mov         .L__flag_x_nan(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__y_is_nan:
+
+    mov         .L__f32_qnan_set(%rip), %r10d
+    and         %r8d, %r10d
+    movd        %r11d, %xmm0
+    jnz         .L__final_check
+
+    movss       save_x(%rsp), %xmm0
+    movss       save_y(%rsp), %xmm1
+    movd        %r11d, %xmm2
+    mov         .L__flag_y_nan(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check
+
+.p2align 4,,15
+.L__x_is_nan_y_is_nan:
+
+    mov         .L__f32_qnan_set(%rip), %r9d
+    and         %edx, %r9d
+    jz          .L__continue_xy_nan
+
+    mov         .L__f32_qnan_set(%rip), %r10d
+    and         %r8d, %r10d
+    jz          .L__continue_xy_nan
+
+    movd        %r11d, %xmm0
+    jmp         .L__final_check
+
+.L__continue_xy_nan:    
+    movss       save_x(%rsp), %xmm0
+    movss       save_y(%rsp), %xmm1
+    movd        %r11d, %xmm2
+    mov         .L__flag_x_nan_y_nan(%rip), %edi
+
+    call        fname_special
+    jmp         .L__final_check  
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_one_y_snan:          .long 1
+.L__flag_x_zero_z_inf:          .long 2
+.L__flag_x_nan:                 .long 3
+.L__flag_y_nan:                 .long 4
+.L__flag_x_nan_y_nan:           .long 5
+.L__flag_x_neg_y_notint:        .long 6
+.L__flag_z_zero:                .long 7
+.L__flag_z_denormal:            .long 8
+.L__flag_z_inf:                 .long 9
+
+.align 16
+
+.L__f32_ay_max_bound:           .long 0x4f000000
+.L__f32_ay_min_bound:           .long 0x2e800000
+.L__f32_sign_mask:              .long 0x80000000
+.L__f32_sign_and_exp_mask:      .long 0x0ff800000
+.L__f32_exp_mask:               .long 0x7f800000
+.L__f32_neg_inf:                .long 0x0ff800000
+.L__f32_pos_inf:                .long 0x7f800000
+.L__f32_pos_one:                .long 0x3f800000
+.L__f32_pos_zero:               .long 0x00000000
+.L__f32_exp_mant_mask:          .long 0x7fffffff
+.L__f32_mant_mask:              .long 0x007fffff
+
+.L__f32_neg_qnan:               .long 0x0ffc00000
+.L__f32_qnan:                   .long 0x7fc00000
+.L__f32_qnan_set:               .long 0x00400000
+
+.L__f32_neg_one:                .long 0x0bf800000
+.L__f32_neg_zero:               .long 0x80000000
+
+.L__f32_real_one:               .long 0x3f800000
+.L__f32_real_zero:              .long 0x00000000
+.L__f32_real_inf:               .long 0x7f800000
+
+.L__yexp_24:                    .long 0x00000018
+
+.L__f32_exp_shift:              .long 0x00000017
+.L__f32_exp_bias:               .long 0x0000007f
+.L__f32_mant_full:              .long 0x007fffff
+.L__f32_1_before_mant:          .long 0x00800000
+
+.align 16
+
+.L__mask_mant_all7:         .quad 0x000fe00000000000
+.L__mask_mant8:             .quad 0x0000100000000000
+
+#---------------------
+# log data
+#---------------------
+
+.align 16
+
+.L__real_ninf:  .quad 0x0fff0000000000000   # -inf
+                .quad 0x0000000000000000
+.L__real_inf:   .quad 0x7ff0000000000000    # +inf
+                .quad 0x0000000000000000
+.L__real_nan:   .quad 0x7ff8000000000000    # NaN
+                .quad 0x0000000000000000
+.L__real_mant:  .quad 0x000FFFFFFFFFFFFF    # mantissa bits
+                .quad 0x0000000000000000
+.L__mask_1023:  .quad 0x00000000000003ff
+                .quad 0x0000000000000000
+
+
+.L__real_log2:      .quad 0x3fe62e42fefa39ef
+                    .quad 0x0000000000000000
+
+.L__real_two:       .quad 0x4000000000000000 # 2
+                    .quad 0x0000000000000000
+
+.L__real_one:       .quad 0x3ff0000000000000 # 1
+                    .quad 0x0000000000000000
+
+.L__real_half:      .quad 0x3fe0000000000000 # 1/2
+                    .quad 0x0000000000000000
+
+.L__real_1_over_1:  .quad 0x3ff0000000000000
+                    .quad 0x0000000000000000
+.L__real_1_over_2:  .quad 0x3fe0000000000000
+                    .quad 0x0000000000000000
+.L__real_1_over_3:  .quad 0x3fd5555555555555
+                    .quad 0x0000000000000000
+.L__real_1_over_4:  .quad 0x3fd0000000000000
+                    .quad 0x0000000000000000
+
+
+.align 16
+.L__log_128_table:
+                    .quad 0x0000000000000000
+                    .quad 0x3f7fe02a6b106789
+                    .quad 0x3f8fc0a8b0fc03e4
+                    .quad 0x3f97b91b07d5b11b
+                    .quad 0x3f9f829b0e783300
+                    .quad 0x3fa39e87b9febd60
+                    .quad 0x3fa77458f632dcfc
+                    .quad 0x3fab42dd711971bf
+                    .quad 0x3faf0a30c01162a6
+                    .quad 0x3fb16536eea37ae1
+                    .quad 0x3fb341d7961bd1d1
+                    .quad 0x3fb51b073f06183f
+                    .quad 0x3fb6f0d28ae56b4c
+                    .quad 0x3fb8c345d6319b21
+                    .quad 0x3fba926d3a4ad563
+                    .quad 0x3fbc5e548f5bc743
+                    .quad 0x3fbe27076e2af2e6
+                    .quad 0x3fbfec9131dbeabb
+                    .quad 0x3fc0d77e7cd08e59
+                    .quad 0x3fc1b72ad52f67a0
+                    .quad 0x3fc29552f81ff523
+                    .quad 0x3fc371fc201e8f74
+                    .quad 0x3fc44d2b6ccb7d1e
+                    .quad 0x3fc526e5e3a1b438
+                    .quad 0x3fc5ff3070a793d4
+                    .quad 0x3fc6d60fe719d21d
+                    .quad 0x3fc7ab890210d909
+                    .quad 0x3fc87fa06520c911
+                    .quad 0x3fc9525a9cf456b4
+                    .quad 0x3fca23bc1fe2b563
+                    .quad 0x3fcaf3c94e80bff3
+                    .quad 0x3fcbc286742d8cd6
+                    .quad 0x3fcc8ff7c79a9a22
+                    .quad 0x3fcd5c216b4fbb91
+                    .quad 0x3fce27076e2af2e6
+                    .quad 0x3fcef0adcbdc5936
+                    .quad 0x3fcfb9186d5e3e2b
+                    .quad 0x3fd0402594b4d041
+                    .quad 0x3fd0a324e27390e3
+                    .quad 0x3fd1058bf9ae4ad5
+                    .quad 0x3fd1675cababa60e
+                    .quad 0x3fd1c898c16999fb
+                    .quad 0x3fd22941fbcf7966
+                    .quad 0x3fd2895a13de86a3
+                    .quad 0x3fd2e8e2bae11d31
+                    .quad 0x3fd347dd9a987d55
+                    .quad 0x3fd3a64c556945ea
+                    .quad 0x3fd404308686a7e4
+                    .quad 0x3fd4618bc21c5ec2
+                    .quad 0x3fd4be5f957778a1
+                    .quad 0x3fd51aad872df82d
+                    .quad 0x3fd5767717455a6c
+                    .quad 0x3fd5d1bdbf5809ca
+                    .quad 0x3fd62c82f2b9c795
+                    .quad 0x3fd686c81e9b14af
+                    .quad 0x3fd6e08eaa2ba1e4
+                    .quad 0x3fd739d7f6bbd007
+                    .quad 0x3fd792a55fdd47a2
+                    .quad 0x3fd7eaf83b82afc3
+                    .quad 0x3fd842d1da1e8b17
+                    .quad 0x3fd89a3386c1425b
+                    .quad 0x3fd8f11e873662c8
+                    .quad 0x3fd947941c2116fb
+                    .quad 0x3fd99d958117e08b
+                    .quad 0x3fd9f323ecbf984c
+                    .quad 0x3fda484090e5bb0a
+                    .quad 0x3fda9cec9a9a084a
+                    .quad 0x3fdaf1293247786b
+                    .quad 0x3fdb44f77bcc8f63
+                    .quad 0x3fdb9858969310fb
+                    .quad 0x3fdbeb4d9da71b7c
+                    .quad 0x3fdc3dd7a7cdad4d
+                    .quad 0x3fdc8ff7c79a9a22
+                    .quad 0x3fdce1af0b85f3eb
+                    .quad 0x3fdd32fe7e00ebd5
+                    .quad 0x3fdd83e7258a2f3e
+                    .quad 0x3fddd46a04c1c4a1
+                    .quad 0x3fde24881a7c6c26
+                    .quad 0x3fde744261d68788
+                    .quad 0x3fdec399d2468cc0
+                    .quad 0x3fdf128f5faf06ed
+                    .quad 0x3fdf6123fa7028ac
+                    .quad 0x3fdfaf588f78f31f
+                    .quad 0x3fdffd2e0857f498
+                    .quad 0x3fe02552a5a5d0ff
+                    .quad 0x3fe04bdf9da926d2
+                    .quad 0x3fe0723e5c1cdf40
+                    .quad 0x3fe0986f4f573521
+                    .quad 0x3fe0be72e4252a83
+                    .quad 0x3fe0e44985d1cc8c
+                    .quad 0x3fe109f39e2d4c97
+                    .quad 0x3fe12f719593efbc
+                    .quad 0x3fe154c3d2f4d5ea
+                    .quad 0x3fe179eabbd899a1
+                    .quad 0x3fe19ee6b467c96f
+                    .quad 0x3fe1c3b81f713c25
+                    .quad 0x3fe1e85f5e7040d0
+                    .quad 0x3fe20cdcd192ab6e
+                    .quad 0x3fe23130d7bebf43
+                    .quad 0x3fe2555bce98f7cb
+                    .quad 0x3fe2795e1289b11b
+                    .quad 0x3fe29d37fec2b08b
+                    .quad 0x3fe2c0e9ed448e8c
+                    .quad 0x3fe2e47436e40268
+                    .quad 0x3fe307d7334f10be
+                    .quad 0x3fe32b1339121d71
+                    .quad 0x3fe34e289d9ce1d3
+                    .quad 0x3fe37117b54747b6
+                    .quad 0x3fe393e0d3562a1a
+                    .quad 0x3fe3b68449fffc23
+                    .quad 0x3fe3d9026a7156fb
+                    .quad 0x3fe3fb5b84d16f42
+                    .quad 0x3fe41d8fe84672ae
+                    .quad 0x3fe43f9fe2f9ce67
+                    .quad 0x3fe4618bc21c5ec2
+                    .quad 0x3fe48353d1ea88df
+                    .quad 0x3fe4a4f85db03ebb
+                    .quad 0x3fe4c679afccee3a
+                    .quad 0x3fe4e7d811b75bb1
+                    .quad 0x3fe50913cc01686b
+                    .quad 0x3fe52a2d265bc5ab
+                    .quad 0x3fe54b2467999498
+                    .quad 0x3fe56bf9d5b3f399
+                    .quad 0x3fe58cadb5cd7989
+                    .quad 0x3fe5ad404c359f2d
+                    .quad 0x3fe5cdb1dc6c1765
+                    .quad 0x3fe5ee02a9241675
+                    .quad 0x3fe60e32f44788d9
+                    .quad 0x3fe62e42fefa39ef
+
+.align 16
+.L__log_F_inv:
+                    .quad 0x4000000000000000
+                    .quad 0x3fffc07f01fc07f0
+                    .quad 0x3fff81f81f81f820
+                    .quad 0x3fff44659e4a4271
+                    .quad 0x3fff07c1f07c1f08
+                    .quad 0x3ffecc07b301ecc0
+                    .quad 0x3ffe9131abf0b767
+                    .quad 0x3ffe573ac901e574
+                    .quad 0x3ffe1e1e1e1e1e1e
+                    .quad 0x3ffde5d6e3f8868a
+                    .quad 0x3ffdae6076b981db
+                    .quad 0x3ffd77b654b82c34
+                    .quad 0x3ffd41d41d41d41d
+                    .quad 0x3ffd0cb58f6ec074
+                    .quad 0x3ffcd85689039b0b
+                    .quad 0x3ffca4b3055ee191
+                    .quad 0x3ffc71c71c71c71c
+                    .quad 0x3ffc3f8f01c3f8f0
+                    .quad 0x3ffc0e070381c0e0
+                    .quad 0x3ffbdd2b899406f7
+                    .quad 0x3ffbacf914c1bad0
+                    .quad 0x3ffb7d6c3dda338b
+                    .quad 0x3ffb4e81b4e81b4f
+                    .quad 0x3ffb2036406c80d9
+                    .quad 0x3ffaf286bca1af28
+                    .quad 0x3ffac5701ac5701b
+                    .quad 0x3ffa98ef606a63be
+                    .quad 0x3ffa6d01a6d01a6d
+                    .quad 0x3ffa41a41a41a41a
+                    .quad 0x3ffa16d3f97a4b02
+                    .quad 0x3ff9ec8e951033d9
+                    .quad 0x3ff9c2d14ee4a102
+                    .quad 0x3ff999999999999a
+                    .quad 0x3ff970e4f80cb872
+                    .quad 0x3ff948b0fcd6e9e0
+                    .quad 0x3ff920fb49d0e229
+                    .quad 0x3ff8f9c18f9c18fa
+                    .quad 0x3ff8d3018d3018d3
+                    .quad 0x3ff8acb90f6bf3aa
+                    .quad 0x3ff886e5f0abb04a
+                    .quad 0x3ff8618618618618
+                    .quad 0x3ff83c977ab2bedd
+                    .quad 0x3ff8181818181818
+                    .quad 0x3ff7f405fd017f40
+                    .quad 0x3ff7d05f417d05f4
+                    .quad 0x3ff7ad2208e0ecc3
+                    .quad 0x3ff78a4c8178a4c8
+                    .quad 0x3ff767dce434a9b1
+                    .quad 0x3ff745d1745d1746
+                    .quad 0x3ff724287f46debc
+                    .quad 0x3ff702e05c0b8170
+                    .quad 0x3ff6e1f76b4337c7
+                    .quad 0x3ff6c16c16c16c17
+                    .quad 0x3ff6a13cd1537290
+                    .quad 0x3ff6816816816817
+                    .quad 0x3ff661ec6a5122f9
+                    .quad 0x3ff642c8590b2164
+                    .quad 0x3ff623fa77016240
+                    .quad 0x3ff6058160581606
+                    .quad 0x3ff5e75bb8d015e7
+                    .quad 0x3ff5c9882b931057
+                    .quad 0x3ff5ac056b015ac0
+                    .quad 0x3ff58ed2308158ed
+                    .quad 0x3ff571ed3c506b3a
+                    .quad 0x3ff5555555555555
+                    .quad 0x3ff5390948f40feb
+                    .quad 0x3ff51d07eae2f815
+                    .quad 0x3ff5015015015015
+                    .quad 0x3ff4e5e0a72f0539
+                    .quad 0x3ff4cab88725af6e
+                    .quad 0x3ff4afd6a052bf5b
+                    .quad 0x3ff49539e3b2d067
+                    .quad 0x3ff47ae147ae147b
+                    .quad 0x3ff460cbc7f5cf9a
+                    .quad 0x3ff446f86562d9fb
+                    .quad 0x3ff42d6625d51f87
+                    .quad 0x3ff4141414141414
+                    .quad 0x3ff3fb013fb013fb
+                    .quad 0x3ff3e22cbce4a902
+                    .quad 0x3ff3c995a47babe7
+                    .quad 0x3ff3b13b13b13b14
+                    .quad 0x3ff3991c2c187f63
+                    .quad 0x3ff3813813813814
+                    .quad 0x3ff3698df3de0748
+                    .quad 0x3ff3521cfb2b78c1
+                    .quad 0x3ff33ae45b57bcb2
+                    .quad 0x3ff323e34a2b10bf
+                    .quad 0x3ff30d190130d190
+                    .quad 0x3ff2f684bda12f68
+                    .quad 0x3ff2e025c04b8097
+                    .quad 0x3ff2c9fb4d812ca0
+                    .quad 0x3ff2b404ad012b40
+                    .quad 0x3ff29e4129e4129e
+                    .quad 0x3ff288b01288b013
+                    .quad 0x3ff27350b8812735
+                    .quad 0x3ff25e22708092f1
+                    .quad 0x3ff2492492492492
+                    .quad 0x3ff23456789abcdf
+                    .quad 0x3ff21fb78121fb78
+                    .quad 0x3ff20b470c67c0d9
+                    .quad 0x3ff1f7047dc11f70
+                    .quad 0x3ff1e2ef3b3fb874
+                    .quad 0x3ff1cf06ada2811d
+                    .quad 0x3ff1bb4a4046ed29
+                    .quad 0x3ff1a7b9611a7b96
+                    .quad 0x3ff19453808ca29c
+                    .quad 0x3ff1811811811812
+                    .quad 0x3ff16e0689427379
+                    .quad 0x3ff15b1e5f75270d
+                    .quad 0x3ff1485f0e0acd3b
+                    .quad 0x3ff135c81135c811
+                    .quad 0x3ff12358e75d3033
+                    .quad 0x3ff1111111111111
+                    .quad 0x3ff0fef010fef011
+                    .quad 0x3ff0ecf56be69c90
+                    .quad 0x3ff0db20a88f4696
+                    .quad 0x3ff0c9714fbcda3b
+                    .quad 0x3ff0b7e6ec259dc8
+                    .quad 0x3ff0a6810a6810a7
+                    .quad 0x3ff0953f39010954
+                    .quad 0x3ff0842108421084
+                    .quad 0x3ff073260a47f7c6
+                    .quad 0x3ff0624dd2f1a9fc
+                    .quad 0x3ff05197f7d73404
+                    .quad 0x3ff0410410410410
+                    .quad 0x3ff03091b51f5e1a
+                    .quad 0x3ff0204081020408
+                    .quad 0x3ff0101010101010
+                    .quad 0x3ff0000000000000
+
+#---------------------
+# exp data
+#---------------------
+
+.align 16
+
+.L__real_zero:                  .quad 0x0000000000000000
+                                .quad 0
+
+.L__real_p4096:                 .quad 0x40b0000000000000
+                                .quad 0
+.L__real_m4768:                 .quad 0x0c0b2a00000000000
+                                .quad 0
+
+.L__real_32_by_log2:            .quad 0x40471547652b82fe # 32/ln(2)
+                                .quad 0
+.L__real_log2_by_32:            .quad 0x3f962e42fefa39ef # log2_by_32
+                                .quad 0
+
+.L__real_1_by_24:               .quad 0x3fa5555555555555 # 1/24
+                                .quad 0
+.L__real_1_by_6:                .quad 0x3fc5555555555555 # 1/6
+                                .quad 0
+.L__real_1_by_2:                .quad 0x3fe0000000000000 # 1/2
+                                .quad 0
+.L__real_1_by_1:                .quad 0x3ff0000000000000 # 1
+                                .quad 0
+
+.align 16
+
+.L__two_to_jby32_table:
+    .quad 0x3ff0000000000000
+    .quad 0x3ff059b0d3158574
+    .quad 0x3ff0b5586cf9890f
+    .quad 0x3ff11301d0125b51
+    .quad 0x3ff172b83c7d517b
+    .quad 0x3ff1d4873168b9aa
+    .quad 0x3ff2387a6e756238
+    .quad 0x3ff29e9df51fdee1
+    .quad 0x3ff306fe0a31b715
+    .quad 0x3ff371a7373aa9cb
+    .quad 0x3ff3dea64c123422
+    .quad 0x3ff44e086061892d
+    .quad 0x3ff4bfdad5362a27
+    .quad 0x3ff5342b569d4f82
+    .quad 0x3ff5ab07dd485429
+    .quad 0x3ff6247eb03a5585
+    .quad 0x3ff6a09e667f3bcd
+    .quad 0x3ff71f75e8ec5f74
+    .quad 0x3ff7a11473eb0187
+    .quad 0x3ff82589994cce13
+    .quad 0x3ff8ace5422aa0db
+    .quad 0x3ff93737b0cdc5e5
+    .quad 0x3ff9c49182a3f090
+    .quad 0x3ffa5503b23e255d
+    .quad 0x3ffae89f995ad3ad
+    .quad 0x3ffb7f76f2fb5e47
+    .quad 0x3ffc199bdd85529c
+    .quad 0x3ffcb720dcef9069
+    .quad 0x3ffd5818dcfba487
+    .quad 0x3ffdfc97337b9b5f
+    .quad 0x3ffea4afa2a490da
+    .quad 0x3fff50765b6e4540
+
+

diff --git a/src/gas/remainder.S b/src/gas/remainder.S
new file mode 100644
index 0000000..173da80
--- /dev/null
+++ b/src/gas/remainder.S

@@ -0,0 +1,256 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# remainder.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+#     double remainder(double x,double y);
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(remainder)
+#define fname_special _remainder_special
+
+
+# local variable storage offsets
+.equ    temp_x, 0x0
+.equ    temp_y, 0x10
+.equ    stack_size,  0x28
+
+.equ    stack_size,  0x80
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+    movd %xmm0,%r8
+    movd %xmm1,%r9
+    movsd %xmm0,%xmm2
+    movsd %xmm1,%xmm3
+    movsd %xmm0,%xmm4
+    movsd %xmm1,%xmm5
+    mov .L__exp_mask_64(%rip), %r10
+    and %r10,%r8
+    and %r10,%r9
+    xor %r10,%r10
+    ror $52, %r8
+    ror $52, %r9
+    cmp $0,%r8
+    jz  .L__LargeExpDiffComputation
+    cmp $0,%r9
+    jz  .L__LargeExpDiffComputation
+    sub %r9,%r8 #
+    cmp $52,%r8
+    jge .L__LargeExpDiffComputation
+    pand .L__Nan_64(%rip),%xmm4
+    pand .L__Nan_64(%rip),%xmm5
+    comisd %xmm5,%xmm4
+    jp  .L__InputIsNaN # if either of xmm1 or xmm0 is a NaN then 
+                       # parity flag is set
+    jz  .L__Input_Is_Equal
+    jbe .L__ReturnImmediate
+    cmp $0x7FF,%r8
+    jz  .L__Dividend_Is_Infinity
+
+    #calculation without using the x87 FPU
+.L__DirectComputation:
+    movapd %xmm4,%xmm2
+    movapd %xmm5,%xmm3
+    divsd %xmm3,%xmm2
+    cvttsd2siq %xmm2,%r8
+    mov   %r8,%r10
+    and   $0X01,%r10
+    cvtsi2sdq %r8,%xmm2
+
+    #multiplication in QUAD Precision
+    #Since the below commented multiplication resulted in an error
+    #we had to implement a quad precision multiplication
+    #logic behind Quad Precision Multiplication
+    #x = hx + tx   by setting x's last 27 bits to null
+    #y = hy + ty   similar to x
+    movapd .L__27bit_andingmask_64(%rip),%xmm4 
+    movapd %xmm5,%xmm1 # x
+    movapd %xmm2,%xmm6 # y
+    movapd %xmm2,%xmm7 # z = xmm7
+    mulpd  %xmm5,%xmm7 # z = x*y
+    andpd  %xmm4,%xmm1
+    andpd  %xmm4,%xmm2
+    subsd  %xmm1,%xmm5 # xmm1 = hx   xmm5 = tx
+    subsd  %xmm2,%xmm6 # xmm2 = hy   xmm6 = ty
+
+    movapd %xmm1,%xmm4 # copy hx
+    mulsd  %xmm2,%xmm4 # xmm4 = hx*hy
+    subsd  %xmm7,%xmm4 # xmm4 = (hx*hy - z)
+    mulsd  %xmm6,%xmm1 # xmm1 = hx * ty
+    addsd  %xmm1,%xmm4 # xmm4 = ((hx * hy - *z) + hx * ty)
+    mulsd  %xmm5,%xmm2 # xmm2 = tx * hy
+    addsd  %xmm2,%xmm4 # xmm4 = (((hx * hy - *z) + hx * ty) + tx * hy)
+    mulsd  %xmm5,%xmm6 # xmm6 = tx * ty
+    addsd  %xmm4,%xmm6 # xmm6 = (((hx * hy - *z) + hx * ty) + tx * hy) + tx * ty;
+    #xmm6 and xmm7 contain the quad precision result
+    #v = dx - c;
+    movapd %xmm0,%xmm1 # copy the input number
+    pand   .L__Nan_64(%rip),%xmm1
+    movapd %xmm1,%xmm2 # xmm2 = dx = xmm1
+    subsd  %xmm7,%xmm1 # v = dx - c
+    subsd  %xmm1,%xmm2 # (dx - v)
+    subsd  %xmm7,%xmm2 # ((dx - v) - c)
+    subsd  %xmm6,%xmm2 # (((dx - v) - c) - cc)
+    addsd  %xmm1,%xmm2 # xmm2 = dx = v + (((dx - v) - c) - cc) 
+                       # xmm3 = w
+    movapd %xmm2,%xmm4 
+    movapd %xmm3,%xmm5 
+    addsd  %xmm4,%xmm4 # xmm4 = dx + dx
+    comisd %xmm4,%xmm3 # if (dx + dx > w)
+    jb .L__Substractw
+    mulpd  .L__ZeroPointFive(%rip),%xmm5 # xmm5 = 0.5 * w
+    comisd %xmm2,%xmm5 # if (dx > 0.5 * w)
+    jb .L__Substractw
+    cmp $0x01,%r10     # If the quotient is an odd number
+    jnz .L__Finish
+    comisd %xmm4,%xmm3 #if (todd && (dx + dx == w)) then subtract w
+    jz  .L__Substractw
+    comisd %xmm0,%xmm5 #if (todd && (dx == 0.5 * w)) then subtract w
+    jnz  .L__Finish
+
+.L__Substractw:
+    subsd %xmm3,%xmm2  # dx -= w
+
+# The following code checks the sign of the input number and then calculate the return Value
+#  return x < 0.0? -dx : dx;
+.L__Finish:
+    comisd .L__Zero_64(%rip), %xmm0 
+    ja  .L__Not_Negative_Number1
+
+.L__Negative_Number1:
+    movapd .L__Zero_64(%rip),%xmm0
+    subsd  %xmm2,%xmm0
+    ret 
+.L__Not_Negative_Number1:
+    movapd %xmm2,%xmm0 
+    ret 
+
+
+    #calculation using the x87 FPU
+    #For numbers whose exponent of either of the divisor,
+    #or dividends are 0. Or for numbers whose exponential 
+    #diff is grater than 52
+.align 16
+.L__LargeExpDiffComputation:
+    sub $stack_size, %rsp
+    movsd %xmm0, temp_x(%rsp)
+    movsd %xmm1, temp_y(%rsp)
+    ffree %st(0)
+    ffree %st(1)
+    fldl  temp_y(%rsp)
+    fldl  temp_x(%rsp)
+    fnclex
+.align 16
+.L__repeat:    
+    fprem1 #Calculate remainder by dividing st(0) with st(1)
+           #fprem operation sets x87 condition codes, 
+           #it will set the C2 code to 1 if a partial remainder is calculated
+    fnstsw %ax 
+    and $0x0400,%ax # Stores Floating-Point Store Status Word into the accumulator 
+                    # we need to check only the C2 bit of the Condition codes
+    cmp $0x0400,%ax # Checks whether the bit 10(C2) is set or not 
+                    # IF its set then a partial remainder was calculated
+    jz .L__repeat 
+    #store the result from the FPU stack to memory
+    fstpl   temp_x(%rsp)
+    fstpl   temp_y(%rsp)
+    movsd   temp_x(%rsp), %xmm0 
+    add $stack_size, %rsp
+    ret 
+
+    #IF both the inputs are equal
+.L__Input_Is_Equal:
+    cmp $0x7FF,%r8
+    jz .L__Dividend_Is_Infinity
+    cmp $0x7FF,%r9
+    jz .L__InputIsNaN
+    movsd %xmm0,%xmm1
+    pand .L__sign_mask_64(%rip),%xmm1
+    movsd .L__Zero_64(%rip),%xmm0
+    por  %xmm1,%xmm0
+    ret
+
+.L__InputIsNaN:
+    por .L__QNaN_mask_64(%rip),%xmm0
+    por .L__exp_mask_64(%rip),%xmm0
+.L__Dividend_Is_Infinity:
+    ret
+
+#Case when x < y
+.L__ReturnImmediate:
+    movapd %xmm5,%xmm7 
+    mulpd  .L__ZeroPointFive(%rip),%xmm5 # 
+    comisd %xmm4,%xmm5
+    jae  .L__FoundResult1
+    subsd %xmm7,%xmm4
+    comisd  .L__Zero_64(%rip),%xmm0
+    ja  .L__Not_Negative_Number
+.L__Negative_Number:
+    movapd .L__Zero_64(%rip),%xmm0
+    subsd  %xmm4,%xmm0
+    ret 
+
+.L__Not_Negative_Number:
+    movapd %xmm4,%xmm0
+    ret
+.align 16    
+.L__FoundResult1:
+    ret
+    
+
+
+.align 32    
+.L__sign_mask_64:          .quad 0x8000000000000000
+                           .quad 0x0
+.L__exp_mask_64:           .quad 0x7FF0000000000000
+                           .quad 0x0
+.L__27bit_andingmask_64:   .quad 0xfffffffff8000000
+                           .quad 0
+.L__2p52_mask_64:          .quad 0x4330000000000000 
+                           .quad 0
+.L__Zero_64:               .quad 0x0 
+                           .quad 0
+.L__QNaN_mask_64:          .quad 0x0008000000000000 
+                           .quad 0
+.L__Nan_64:                .quad 0x7FFFFFFFFFFFFFFF
+                           .quad 0
+.L__ZeroPointFive:         .quad 0X3FE0000000000000
+                           .quad 0
+

diff --git a/src/gas/remainderf.S b/src/gas/remainderf.S
new file mode 100644
index 0000000..d196d11
--- /dev/null
+++ b/src/gas/remainderf.S

@@ -0,0 +1,221 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# remainderf.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+#     float remainderf(float x,float y);
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(remainderf)
+#define fname_special _remainderf_special
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+    mov .L__exp_mask_64(%rip), %rdi
+    movapd .L__sign_mask_64(%rip),%xmm6
+    cvtss2sd %xmm0,%xmm2 # double x
+    cvtss2sd %xmm1,%xmm3 # double y
+    pand %xmm6,%xmm2
+    pand %xmm6,%xmm3
+    movd %xmm2,%rax
+    movd %xmm3,%r8
+    mov %rax,%r11
+    mov %r8,%r9
+    movsd %xmm2,%xmm4
+    #take the exponents of both x and y
+    and %rdi,%rax
+    and %rdi,%r8
+    ror $52, %rax
+    ror $52, %r8
+    #ifeither of the exponents is infinity 
+    cmp $0X7FF,%rax 
+    jz  .L__InputIsNaN 
+    cmp $0X7FF,%r8 
+    jz  .L__InputIsNaNOrInf
+
+    cmp $0,%r8
+    jz  .L__Divisor_Is_Zero
+
+    cmp %r9, %r11
+    jz  .L__Input_Is_Equal
+    jb  .L__ReturnImmediate
+
+    xor %rcx,%rcx
+    mov $24,%rdx
+    movsd .L__One_64(%rip),%xmm7 # xmm7 = scale
+    cmp %rax,%r8 
+    jae .L__y_is_greater
+    #xmm3 = dy
+    sub %r8,%rax
+    div %dl       # al = ntimes
+    mov %al,%cl   # cl = ntimes
+    and $0xFF,%ax # set everything t o zero except al
+    mul %dl       # ax = dl * al = 24* ntimes
+    add $1023, %rax
+    shl $52,%rax
+    movd %rax,%xmm7 # xmm7 = scale
+.L__y_is_greater:
+    mulsd %xmm3,%xmm7 # xmm7 = scale * dy
+    movsd .L__2pminus24_decimal(%rip),%xmm6
+
+.align 16
+.L__Start_Loop:
+    dec %cl
+    js .L__End_Loop
+    divsd %xmm7,%xmm4     # xmm7 = (dx / w)
+    cvttsd2siq %xmm4,%rax 
+    cvtsi2sdq %rax,%xmm4  # xmm4 = t = (double)((int)(dx / w))
+    mulsd  %xmm7,%xmm4    # xmm4 = w*t
+    mulsd %xmm6,%xmm7     # w*= scale 
+    subsd  %xmm4,%xmm2    # xmm2 = dx -= w*t  
+    movsd %xmm2,%xmm4     # xmm4 = dx
+    jmp .L__Start_Loop
+.L__End_Loop:    
+    divsd %xmm7,%xmm4     # xmm7 = (dx / w)
+    cvttsd2siq %xmm4,%rax 
+    cvtsi2sdq %rax,%xmm4  # xmm4 = t = (double)((int)(dx / w))
+    and $0x01,%rax        # todd = todd = ((int)(dx / w)) & 1 
+    mulsd  %xmm7,%xmm4    # xmm4 = w*t
+    subsd  %xmm4,%xmm2    # xmm2 = dx -= w*t  
+    movsd  %xmm7,%xmm6    # store w
+    mulsd .L__Zero_Point_Five64(%rip),%xmm7 #xmm7 = 0.5*w
+    
+    cmp $0x01,%rax
+    jnz .L__todd_is_even
+    comisd %xmm2,%xmm7
+    je .L__Subtract_w
+
+.L__todd_is_even: 
+    comisd %xmm2,%xmm7
+    jnb .L__Dont_Subtract_w
+    
+.L__Subtract_w:    
+    subsd %xmm6,%xmm2
+    
+.L__Dont_Subtract_w:
+    comiss .L__Zero_64(%rip),%xmm0 
+    jb .L__Negative
+    cvtsd2ss %xmm2,%xmm0 
+    ret
+.L__Negative:
+    movsd .L__MinusZero_64(%rip),%xmm0
+    subsd %xmm2,%xmm0
+    cvtsd2ss %xmm0,%xmm0 
+    ret
+
+.align 16
+.L__Input_Is_Equal:
+    cmp $0x7FF,%rax
+    jz .L__Dividend_Is_Infinity
+    cmp $0x7FF,%r8
+    jz .L__InputIsNaNOrInf
+    movsd %xmm0,%xmm1
+    pand .L__sign_bit_32(%rip),%xmm1
+    movss .L__Zero_64(%rip),%xmm0
+    por  %xmm1,%xmm0
+    ret
+
+.L__InputIsNaNOrInf:
+    comiss %xmm0,%xmm1
+    jp .L__InputIsNaN
+    ret
+.L__Divisor_Is_Zero:
+.L__InputIsNaN:
+    por .L__exp_mask_32(%rip),%xmm0
+.L__Dividend_Is_Infinity:
+    por .L__QNaN_mask_32(%rip),%xmm0
+    ret
+
+#Case when x < y
+    #xmm2 = dx
+.L__ReturnImmediate:
+    movsd %xmm3,%xmm5
+    mulsd .L__Zero_Point_Five64(%rip), %xmm3 # xmm3 = 0.5*dy
+    comisd %xmm3,%xmm2 # if (dx > 0.5*dy)
+    jna .L__Finish_Immediate # xmm2 <= xmm3
+    subsd %xmm5,%xmm2 #dx -= dy
+    
+.L__Finish_Immediate:
+    comiss .L__Zero_64(%rip),%xmm0
+    #xmm0 contains the input and is the result
+    jz .L__Zero
+    ja .L__Positive
+
+    movsd .L__Zero_64(%rip),%xmm0
+    subsd %xmm2,%xmm0
+    cvtsd2ss %xmm0,%xmm0
+    ret
+
+.L__Zero:
+    ret
+        
+.L__Positive:
+    cvtsd2ss %xmm2,%xmm0
+    ret
+    
+
+
+.align 32    
+.L__sign_bit_32:           .quad 0x8000000080000000
+                           .quad 0x0
+.L__exp_mask_64:           .quad 0x7FF0000000000000
+                           .quad 0x0
+.L__exp_mask_32:           .quad 0x000000007F800000
+                           .quad 0x0
+.L__27bit_andingmask_64:   .quad 0xfffffffff8000000
+                           .quad 0
+.L__2p52_mask_64:          .quad 0x4330000000000000 
+                           .quad 0
+.L__One_64:                .quad 0x3FF0000000000000 
+                           .quad 0
+.L__Zero_64:               .quad 0x0 
+                           .quad 0
+.L__MinusZero_64:          .quad 0x8000000000000000 
+                           .quad 0
+.L__QNaN_mask_32:          .quad 0x0000000000400000
+                           .quad 0
+.L__sign_mask_64:          .quad 0x7FFFFFFFFFFFFFFF
+                           .quad 0
+.L__2pminus24_decimal:     .quad 0x3E70000000000000
+                           .quad 0
+.L__Zero_Point_Five64:     .quad 0x3FE0000000000000
+                           .quad 0
+

diff --git a/src/gas/round.S b/src/gas/round.S
new file mode 100644
index 0000000..c1ac20a
--- /dev/null
+++ b/src/gas/round.S

@@ -0,0 +1,151 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# round.S
+#
+# An implementation of the round libm function.
+#
+# Prototype:
+#
+#     double round(double x);
+#
+
+#
+#   Algorithm: First get the exponent of the input 
+#              double precision number. 
+#              IF exponent is greater than 51 then return the 
+#              input as is. 
+#              IF exponent is less than 0 then force an overflow
+#              by adding a huge number and subtracting with the 
+#              same number. 
+#              IF exponent is greater than 0 then add 0.5 and 
+#              and shift the mantissa bits based on the exponent
+#              value to discard the fractional component. 
+#              
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(round)
+#define fname_special _round_special
+
+
+# local variable storage offsets
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+#in sse5 there is a roundss,roundsd instruction
+fname:
+    movsd .L__2p52_plus_one(%rip),%xmm4
+    movsd .L__sign_mask_64(%rip),%xmm5
+    mov $52,%r10
+    #take 3 copies of the input xmm0
+    movsd %xmm0,%xmm1
+    movsd %xmm0,%xmm2
+    movsd %xmm0,%xmm3
+    #get the Most signifacnt half word of the input number in r9
+    pand .L__exp_mask_64(%rip), %xmm1
+    pextrw $3,%xmm1,%r9
+    cmp $0X7FF0,%r9
+    #Check for infinity inputs
+    jz .L__is_infinity
+    movsd .L__sign_mask_64(%rip), %xmm1
+    pandn %xmm2,%xmm1 # xmm1 now stores the sign of the input number 
+    #On shifting r9 and subtracting with 0x3FF 
+    #r9 stores the exponent.
+    shr   $0X4,%r9
+    sub $0x3FF,%r9
+    cmp $0x00, %r9
+    jl .L__number_less_than_zero
+
+    #IF exponent is greater than 0
+.L__number_greater_than_zero:
+    cmp $51,%r9
+    jg .L__is_greater_than_2p52
+    
+    #IF exponent is greater than 0 and less than 2^52
+    pand .L__sign_mask_64(%rip),%xmm0
+    #add with 0.5
+    addsd .L__zero_point_5(%rip),%xmm0
+    movsd %xmm0,%xmm5
+
+    pand  .L__exp_mask_64(%rip),%xmm5
+    pand  .L__mantissa_mask_64(%rip),%xmm0
+    #r10 = r9(input exponent) - r10(52=mantissa length) 
+    sub %r9,%r10
+    movd %r10, %xmm2
+    #do right and left shift by (input exp - mantissa length)
+    psrlq %xmm2,%xmm0
+    psllq %xmm2,%xmm0
+    #OR the input exponent with the input sign
+    por  %xmm1,%xmm5
+    #finally OR with the matissa
+    por %xmm5,%xmm0
+    ret
+
+    #IF exponent is less than 0
+.L__number_less_than_zero:
+    pand %xmm5,%xmm3 # xmm3 =abs(input) 
+    addsd %xmm4,%xmm3# add (2^52 + 1)
+    subsd %xmm4,%xmm3# sub (2^52 + 1)
+    por %xmm1, %xmm3 # OR with the sign of the input number 
+    movsd %xmm3,%xmm0 
+    ret
+
+    #IF the input is infinity
+.L__is_infinity:
+    comisd %xmm4,%xmm0
+    jnp .L__is_zero #parity flag is raised 
+                    #IF one of theinputs is a Nan
+.L__is_nan :
+    por .L__qnan_mask_64(%rip),%xmm0 # set the QNan Bit
+.L__is_zero :
+.L__is_greater_than_2p52:    
+    ret
+
+.align 16
+.L__sign_mask_64:          .quad 0x7FFFFFFFFFFFFFFF
+                           .quad 0
+
+.L__qnan_mask_64:          .quad 0x0008000000000000
+                           .quad 0
+.L__exp_mask_64:           .quad 0x7FF0000000000000
+                           .quad 0
+.L__mantissa_mask_64:      .quad 0x000FFFFFFFFFFFFF
+                           .quad 0
+.L__zero:                  .quad 0x0000000000000000
+                           .quad 0
+.L__2p52_plus_one:         .quad 0x4330000000000001  # = 4503599627370497.0
+                           .quad 0
+.L__zero_point_5:          .quad 0x3FE0000000000001  # = 00.5
+                           .quad 0
+
+
+

diff --git a/src/gas/sin.S b/src/gas/sin.S
new file mode 100644
index 0000000..378e103
--- /dev/null
+++ b/src/gas/sin.S

@@ -0,0 +1,481 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# An implementation of the sin function.
+#
+# Prototype:
+#
+#     double sin(double x);
+#
+#   Computes sin(x).
+#   It will provide proper C99 return values,
+#   but may not raise floating point status bits properly.
+#   Based on the NAG C implementation.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 32
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000  # 0.5
+                  .quad 0                             # for alignment
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000  # 1.0
+                  .quad 0               
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555  # 0.166666666666
+                  .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883  # twobypi
+                  .quad 0
+.L__real_411E848000000000: .quad 0x415312d000000000   # 5e6 0x0411E848000000000  # 5e5
+                  .quad 0
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff  # Sign bit zero
+                  .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000  # piby2_1
+                  .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331  # piby2_1tail
+                  .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000  # piby2_2
+                  .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073  # piby2_2tail
+                  .quad 0               
+                  
+.align 32
+.Lcosarray:
+   .quad   0x03fa5555555555555                        # 0.0416667         c1
+   .quad   0
+   .quad   0x0bf56c16c16c16967                        # -0.00138889       c2
+   .quad   0
+   .quad   0x03EFA01A019F4EC91                        # 2.48016e-005      c3
+   .quad   0
+   .quad   0x0bE927E4FA17F667B                        # -2.75573e-007     c4
+   .quad   0
+   .quad   0x03E21EEB690382EEC                        # 2.08761e-009      c5
+   .quad   0
+   .quad   0x0bDA907DB47258AA7                        # -1.13826e-011     c6
+   .quad   0
+
+.align 32
+.Lsinarray:
+   .quad   0x0bfc5555555555555                        # -0.166667         s1
+   .quad   0
+   .quad   0x03f81111111110bb3                        # 0.00833333        s2
+   .quad   0
+   .quad   0x0bf2a01a019e83e5c                        # -0.000198413      s3
+   .quad   0
+   .quad   0x03ec71de3796cde01                        # 2.75573e-006      s4
+   .quad   0
+   .quad   0x0be5ae600b42fdfa7                        # -2.50511e-008     s5
+   .quad   0
+   .quad   0x03de5e0b2f9a43bb8                        # 1.59181e-010      s6
+   .quad   0
+
+.text
+.align 32
+.p2align 4,,15
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(sin)
+#define fname_special _sin_special@PLT
+
+# define local variable storage offsets
+.equ   p_temp,     0x30                               # temporary for get/put bits operation
+.equ   p_temp1,    0x40                               # temporary for get/put bits operation
+.equ   r,          0x50                               # pointer to r for amd_remainder_piby2
+.equ   rr,         0x60                               # pointer to rr for amd_remainder_piby2
+.equ   region,     0x70                               # pointer to region for amd_remainder_piby2
+.equ   stack_size, 0x98
+
+.globl fname
+.type  fname,@function
+
+fname:
+   sub      $stack_size, %rsp
+   xorpd    %xmm2, %xmm2                              # zeroed out for later use
+
+# GET_BITS_DP64(x, ux);
+# get the input value to an integer register.
+   movsd    %xmm0, p_temp(%rsp)
+   mov      p_temp(%rsp), %rdx                        # rdx is ux
+
+##  if NaN or inf
+   mov      $0x07ff0000000000000, %rax
+   mov      %rax, %r10
+   and      %rdx, %r10
+   cmp      %rax, %r10
+   jz       .Lsin_naninf
+
+#  ax = (ux & ~SIGNBIT_DP64);
+   mov      $0x07fffffffffffffff, %r10
+   and      %rdx, %r10                                # r10 is ax
+   mov      $1, %r8d                                  # for determining region later on
+
+##  if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+   mov      $0x03fe921fb54442d18, %rax
+   cmp      %rax, %r10
+   jg       .Lsin_reduce
+
+##      if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+   mov      $0x03f20000000000000, %rax
+   cmp      %rax, %r10
+   jge      .Lsin_small
+
+##          if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */
+   mov      $0x03e40000000000000, %rax
+   cmp      %rax, %r10
+   jge      .Lsin_smaller
+
+#                  sin = 1.0;
+   jmp      .Lsin_cleanup         
+   
+.align 32
+.Lsin_smaller:
+#              sin = x - x^3 * 0.1666666666666666666;
+   movsd    %xmm0, %xmm2
+   movsd    .L__real_3fc5555555555555(%rip), %xmm4    # 0.1666666666666666666
+   mulsd    %xmm2, %xmm2                              # x^2
+   mulsd    %xmm0, %xmm2                              # x^3
+   mulsd    %xmm4, %xmm2                              # x^3 * 0.1666666666666666666
+   subsd    %xmm2, %xmm0                              # x - x^3 * 0.1666666666666666666
+   jmp      .Lsin_cleanup           
+
+.align 32
+.Lsin_small:
+#          sin = sin_piby4(x, 0.0);
+   movsd    .L__real_3fe0000000000000(%rip), %xmm5   # .5
+
+.Lsin_piby4_noreduce:
+   movsd    %xmm0, %xmm2
+   mulsd    %xmm0, %xmm2                              # x2
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2    - do a sin calculation
+#  zs = (s2 + x2 * (s3 + x2 * (s4 + x2 * (s5 + x2 * s6))));
+   movsd    .Lsinarray+0x50(%rip), %xmm3              # s6
+   mulsd    %xmm2, %xmm3                              # x2s6
+   movsd    .Lsinarray+0x20(%rip), %xmm5              # s3
+   movsd    %xmm2, %xmm1                              # move for x4
+   mulsd    %xmm2, %xmm1                              # x4
+   mulsd    %xmm2, %xmm5                              # x2s3
+   movsd    %xmm0, %xmm4                              # move for x3
+   addsd    .Lsinarray+0x40(%rip), %xmm3              # s5+x2s6
+   mulsd    %xmm2, %xmm1                              # x6
+   mulsd    %xmm2, %xmm3                              # x2(s5+x2s6)
+   mulsd    %xmm2, %xmm4                              # x3
+   addsd    .Lsinarray+0x10(%rip), %xmm5              # s2+x2s3
+   mulsd    %xmm2, %xmm5                              # x2(s2+x2s3)
+   addsd    .Lsinarray+0x30(%rip), %xmm3              # s4 + x2(s5+x2s6)
+   mulsd    %xmm1, %xmm3                              # x6(s4 + x2(s5+x2s6))
+   addsd    .Lsinarray(%rip), %xmm5                   # s1+x2(s2+x2s3)
+   addsd    %xmm5, %xmm3                              # zs
+   mulsd    %xmm3, %xmm4                              # *x3
+   addsd    %xmm4, %xmm0                              # +x
+   jmp      .Lsin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsin_reduce:
+#  xneg = (ax != ux);
+   cmp      %r10, %rdx
+   mov      $0, %r11d
+
+##  if (xneg) x = -x;
+   jz       .Lpositive
+   mov      $1, %r11d
+   subsd    %xmm0, %xmm2
+   movsd    %xmm2, %xmm0
+
+.align 16
+.Lpositive:
+##  if (x < 5.0e5)
+   cmp      .L__real_411E848000000000(%rip), %r10
+   jae      .Lsin_reduce_precise
+
+# reduce  the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+   movsd    %xmm0, %xmm2
+   movsd    .L__real_3fe45f306dc9c883(%rip), %xmm3    # twobypi
+   movsd    %xmm0, %xmm4
+   movsd    .L__real_3fe0000000000000(%rip), %xmm5    # .5
+   mulsd    %xmm3, %xmm2
+
+#/* How many pi/2 is x a multiple of? */
+#      xexp  = ax >> EXPSHIFTBITS_DP64;
+   mov      %r10, %r9
+   shr      $52, %r9                                  # >>EXPSHIFTBITS_DP64
+
+#        npi2  = (int)(x * twobypi + 0.5);
+   addsd    %xmm5, %xmm2                              # npi2
+
+   movsd    .L__real_3ff921fb54400000(%rip), %xmm3    # piby2_1
+   cvttpd2dq   %xmm2, %xmm0                           # convert to integer
+   movsd    .L__real_3dd0b4611a626331(%rip), %xmm1    # piby2_1tail
+   cvtdq2pd   %xmm0, %xmm2                            # and back to float.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+#      rhead  = x - npi2 * piby2_1;
+   mulsd    %xmm2, %xmm3
+   subsd    %xmm3, %xmm4                              # rhead
+
+#      rtail  = npi2 * piby2_1tail;
+   mulsd    %xmm2, %xmm1
+   movd     %xmm0, %eax
+
+#      GET_BITS_DP64(rhead-rtail, uy);               
+   movsd    %xmm4, %xmm0
+   subsd    %xmm1, %xmm0
+
+   movsd    .L__real_3dd0b4611a600000(%rip), %xmm3    # piby2_2
+   movsd    %xmm0,p_temp(%rsp)
+   movsd    .L__real_3ba3198a2e037073(%rip), %xmm5    # piby2_2tail
+   mov      p_temp(%rsp), %rcx                        # rcx is rhead-rtail
+
+#   xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc
+#      expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+   shl      $1, %rcx                                  # strip any sign bit
+   shr      $53, %rcx                                 # >> EXPSHIFTBITS_DP64 +1
+   sub      %rcx, %r9                                 # expdiff
+
+##      if (expdiff > 15)
+   cmp      $15, %r9
+   jle      .Lexplediff15
+
+#          /* The remainder is pretty small compared with x, which
+#             implies that x is a near multiple of pi/2
+#             (x matches the multiple to at least 15 bits) */
+
+#          t  = rhead;
+   movsd    %xmm4, %xmm1
+
+#          rtail  = npi2 * piby2_2;
+   mulsd    %xmm2, %xmm3
+
+#          rhead  = t - rtail;
+   mulsd    %xmm2, %xmm5                              # npi2 * piby2_2tail
+   subsd    %xmm3, %xmm4                              # rhead
+
+#          rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+   subsd    %xmm4, %xmm1                              # t - rhead
+   subsd    %xmm3, %xmm1                              # -rtail
+   subsd    %xmm1, %xmm5                              # rtail
+
+#      r = rhead - rtail;
+   movsd    %xmm4, %xmm0
+
+#HARSHA
+#xmm1=rtail
+   movsd    %xmm5, %xmm1
+   subsd    %xmm5, %xmm0
+
+#   xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexplediff15:
+#      region = npi2 & 3;
+
+   subsd    %xmm0, %xmm4                              # rhead-r
+   subsd    %xmm1, %xmm4                              # rr = (rhead-r) - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+## if the input was close to a pi/2 multiple
+# The original NAG code missed this trick.  If the input is very close to n*pi/2 after
+# reduction,
+# then the sin is ~ 1.0 , to within 53 bits, when r is < 2^-27.  We already
+# have x at this point, so we can skip the sin polynomials.
+
+   cmp      $0x03f2, %rcx                             # if r  small.
+   jge      .Lsin_piby4                               # use taylor series if not
+   cmp      $0x03de, %rcx                             # if r really small.
+   jle      .Lr_small                                 # then sin(r) = 0
+
+   movsd    %xmm0, %xmm2
+   mulsd    %xmm2, %xmm2                              # x^2
+
+##      if region is 0 or 2 do a sin calc.
+   and      %eax, %r8d
+   jnz      .Lcossmall
+
+# region 0 or 2 do a sin calculation
+# use simply polynomial
+#              x - x*x*x*0.166666666666666666;
+   movsd    .L__real_3fc5555555555555(%rip), %xmm3     
+   mulsd    %xmm0, %xmm3                              # * x
+   mulsd    %xmm2, %xmm3                              # * x^2
+   subsd    %xmm3, %xmm0                              # xs
+   jmp      .Ladjust_region
+
+.align 16
+.Lcossmall:
+# region 1 or 3 do a cos calculation
+# use simply polynomial
+#              1.0 - x*x*0.5;
+   movsd    .L__real_3ff0000000000000(%rip), %xmm0    # 1.0
+   mulsd    .L__real_3fe0000000000000(%rip), %xmm2    # 0.5 *x^2
+   subsd    %xmm2, %xmm0                              # xc
+   jmp      .Ladjust_region
+
+.align 16
+.Lr_small:
+##      if region is 1 or 3   do a cos calc.
+   and      %eax, %r8d
+   jz       .Ladjust_region
+
+# odd
+   movsd    .L__real_3ff0000000000000(%rip), %xmm0    # cos(r) is a 1
+   jmp      .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 32
+.Lsin_reduce_precise:
+#      // Reduce x into range [-pi/4,pi/4]
+#      __amd_remainder_piby2(x, &r, &rr, &region);
+
+   mov      %r11,p_temp(%rsp)
+   lea      region(%rsp), %rdx
+   lea      rr(%rsp), %rsi
+   lea      r(%rsp), %rdi
+   
+   call     __amd_remainder_piby2@PLT
+
+   mov      p_temp(%rsp), %r11
+   mov      $1, %r8d                                  # for determining region later on
+   movsd    r(%rsp), %xmm0                            # x
+   movsd    rr(%rsp), %xmm4                           # xx
+   mov      region(%rsp), %eax                        # region
+
+# xmm0 = x, xmm4 = xx, r8d = 1, eax= region
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+# perform taylor series to calc sinx, sinx
+.Lsin_piby4:
+#  x2 = r * r;
+
+#xmm4 = a part of rr for the sin path, xmm4 is overwritten in the sin path
+#instead use xmm3 because that was freed up in the sin path, xmm3 is overwritten in sin path
+   movsd    %xmm0, %xmm3
+   movsd    %xmm0, %xmm2
+   mulsd    %xmm0, %xmm2                              # x2
+
+##      if region is 0 or 2   do a sin calc.
+   and      %eax, %r8d
+   jnz      .Lcosregion
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2 do a sin calculation
+   movsd    .Lsinarray+0x50(%rip), %xmm3              # s6
+   mulsd    %xmm2, %xmm3                              # x2s6
+   movsd    .Lsinarray+0x20(%rip), %xmm5              # s3
+   movsd    %xmm4,p_temp(%rsp)                        # store xx
+   movsd    %xmm2, %xmm1                              # move for x4
+   mulsd    %xmm2, %xmm1                              # x4
+   movsd    %xmm0,p_temp1(%rsp)                       # store x
+   mulsd    %xmm2, %xmm5                              # x2s3
+   movsd    %xmm0, %xmm4                              # move for x3
+   addsd    .Lsinarray+0x40(%rip), %xmm3              # s5+x2s6
+   mulsd    %xmm2, %xmm1                              # x6
+   mulsd    %xmm2, %xmm3                              # x2(s5+x2s6)
+   mulsd    %xmm2, %xmm4                              # x3
+   addsd    .Lsinarray+0x10(%rip), %xmm5              # s2+x2s3
+   mulsd    %xmm2, %xmm5                              # x2(s2+x2s3)
+   addsd    .Lsinarray+0x30(%rip), %xmm3              # s4 + x2(s5+x2s6)
+   mulsd    .L__real_3fe0000000000000(%rip), %xmm2    # 0.5 *x2
+   movsd    p_temp(%rsp), %xmm0                       # load xx
+   mulsd    %xmm1, %xmm3                              # x6(s4 + x2(s5+x2s6))
+   addsd    .Lsinarray(%rip), %xmm5                   # s1+x2(s2+x2s3)
+   mulsd    %xmm0, %xmm2                              # 0.5 * x2 *xx
+   addsd    %xmm5, %xmm3                              # zs
+   mulsd    %xmm3, %xmm4                              # *x3
+   subsd    %xmm2, %xmm4                              # x3*zs - 0.5 * x2 *xx
+   addsd    %xmm4, %xmm0                              # +xx
+   addsd    p_temp1(%rsp), %xmm0                      # +x
+   jmp      .Ladjust_region
+
+.align 16
+.Lcosregion:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 1 or 3    - do a cos calculation
+#  zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6))));
+   mulsd    %xmm0, %xmm4                              # x*xx
+   movsd    .L__real_3fe0000000000000(%rip), %xmm5
+   movsd    .Lcosarray+0x50(%rip), %xmm1              # c6
+   movsd    .Lcosarray+0x20(%rip), %xmm0              # c3
+   mulsd    %xmm2, %xmm5                              # r = 0.5 *x2
+   movsd    %xmm2, %xmm3                              # copy of x2
+   movsd    %xmm4,p_temp(%rsp)                        # store x*xx
+   mulsd    %xmm2, %xmm1                              # c6*x2
+   mulsd    %xmm2, %xmm0                              # c3*x2
+   subsd    .L__real_3ff0000000000000(%rip), %xmm5    # -t=r-1.0   ;trash r
+   mulsd    %xmm2, %xmm3                              # x4
+   addsd    .Lcosarray+0x40(%rip), %xmm1              # c5+x2c6
+   addsd    .Lcosarray+0x10(%rip), %xmm0              # c2+x2C3
+   addsd    .L__real_3ff0000000000000(%rip), %xmm5    # 1 + (-t)   ;trash t
+   mulsd    %xmm2, %xmm3                              # x6
+   mulsd    %xmm2, %xmm1                              # x2(c5+x2c6)
+   mulsd    %xmm2, %xmm0                              # x2(c2+x2C3)
+   movsd    %xmm2, %xmm4                              # copy of x2
+   mulsd    .L__real_3fe0000000000000(%rip), %xmm4    # r recalculate
+   addsd    .Lcosarray+0x30(%rip), %xmm1              # c4 + x2(c5+x2c6)
+   addsd    .Lcosarray(%rip), %xmm0                   # c1+x2(c2+x2C3)
+   mulsd    %xmm2, %xmm2                              # x4 recalculate
+   subsd    %xmm4, %xmm5                              # (1 + (-t)) - r
+   mulsd    %xmm3, %xmm1                              # x6(c4 + x2(c5+x2c6))
+   addsd    %xmm1, %xmm0                              # zc
+   subsd    .L__real_3ff0000000000000(%rip), %xmm4    # t relaculate
+   subsd    p_temp(%rsp), %xmm5                       # ((1 + (-t)) - r) - x*xx
+   mulsd    %xmm2, %xmm0                              # x4 * zc
+   addsd    %xmm5, %xmm0                              # x4 * zc + ((1 + (-t)) - r -x*xx)
+   subsd    %xmm4, %xmm0                              # result - (-t)
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 16
+.Ladjust_region:      # positive or negative
+#      switch (region)
+   shr      $1, %eax
+   mov      %eax, %ecx
+   and      %r11d, %eax
+   not      %ecx
+   not      %r11d
+   and      %r11d, %ecx
+   or       %ecx, %eax
+   and      $1, %eax
+   jnz      .Lsin_cleanup
+
+## if the original region 0, 1 and arg is negative, then we negate the result.
+## if the original region 2, 3 and arg is positive, then we negate the result.
+   movsd    %xmm0, %xmm2
+   xorpd    %xmm0, %xmm0
+   subsd    %xmm2, %xmm0
+
+.align 16
+.Lsin_cleanup:
+   add      $stack_size, %rsp
+   ret
+
+.align 16
+.Lsin_naninf:
+   call     fname_special
+   add      $stack_size, %rsp
+   ret
+
+
+

diff --git a/src/gas/sincos.S b/src/gas/sincos.S
new file mode 100644
index 0000000..6558f9e
--- /dev/null
+++ b/src/gas/sincos.S

@@ -0,0 +1,616 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# An implementation of the sincos function.
+#
+# Prototype:
+#
+#   void sincos(double x, double* sinr, double* cosr);
+#
+#   Computes sincos
+#   It will provide proper C99 return values,
+#   but may not raise floating point status bits properly.
+#   Based on the NAG C implementation.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff  #Sign bit zero
+                        .quad 0                       # for alignment
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000  # 1.0
+                        .quad 0                          
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000  # 0.5
+                        .quad 0
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555  # 0.166666666666
+                        .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883  # twobypi
+                        .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000  # piby2_1
+                        .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331  # piby2_1tail
+                        .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000  # piby2_2
+                        .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073  # piby2_2tail
+                        .quad 0                
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000  # mask for stripping head and tail
+                        .quad 0                
+.L__real_411E848000000000: .quad 0x415312d000000000   # 5e6 0x0411E848000000000  # 5e5
+                        .quad 0
+
+.align 16
+.Lsincosarray:
+    .quad    0x0bfc5555555555555                      # -0.16666666666666666    s1
+    .quad    0x03fa5555555555555                      # 0.041666666666666664    c1   
+    .quad    0x03f81111111110bb3                      # 0.00833333333333095     s2
+    .quad    0x0bf56c16c16c16967                      # -0.0013888888888887398  c2
+    .quad    0x0bf2a01a019e83e5c                      # -0.00019841269836761127 s3
+    .quad    0x03efa01a019f4ec90                      # 2.4801587298767041E-05  c3
+    .quad    0x03ec71de3796cde01                      # 2.7557316103728802E-06  s4
+    .quad    0x0be927e4fa17f65f6                      # -2.7557317272344188E-07 c4
+    .quad    0x0be5ae600b42fdfa7                      # -2.5051132068021698E-08 s5
+    .quad    0x03e21eeb69037ab78                      # 2.0876146382232963E-09  c6
+    .quad    0x03de5e0b2f9a43bb8                      # 1.5918144304485914E-10  s6
+    .quad    0x0bda907db46cc5e42                      # -1.1382639806794487E-11 c7
+
+.align 16
+.Lcossinarray:
+    .quad    0x03fa5555555555555                      # 0.0416667        c1
+    .quad    0x0bfc5555555555555                      # -0.166667        s1
+    .quad    0x0bf56c16c16c16967
+    .quad    0x03f81111111110bb3                      # 0.00833333       s2
+    .quad    0x03efa01a019f4ec90
+    .quad    0x0bf2a01a019e83e5c                      # -0.000198413     s3
+    .quad    0x0be927e4fa17f65f6
+    .quad    0x03ec71de3796cde01                      # 2.75573e-006     s4
+    .quad    0x03e21eeb69037ab78
+    .quad    0x0be5ae600b42fdfa7                      # -2.50511e-008    s5
+    .quad    0x0bda907db46cc5e42
+    .quad    0x03de5e0b2f9a43bb8                      # 1.59181e-010     s6
+
+.text
+.align 16
+.p2align 4,,15
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(sincos)
+#define fname_special _sincos_special@PLT
+    
+# define local variable storage offsets
+.equ    p_temp,    0x30                               # temporary for get/put bits operation
+.equ    p_temp1,   0x40                               # temporary for get/put bits operation
+.equ    r,         0x50                               # pointer to r for amd_remainder_piby2
+.equ    rr,        0x60                               # pointer to rr for amd_remainder_piby2
+.equ    region,    0x70                               # pointer to region for amd_remainder_piby2
+.equ   stack_size, 0x98
+
+.globl fname
+.type  fname,@function
+
+fname:
+    sub     $stack_size, %rsp
+    xorpd   %xmm2,%xmm2                               # zeroed out for later use
+
+# GET_BITS_DP64(x, ux);
+# get the input value to an integer register.
+    movsd   %xmm0,p_temp(%rsp)
+    mov     p_temp(%rsp),%rcx                         # rcx is ux
+
+##  if NaN or inf
+    mov     $0x07ff0000000000000,%rax
+    mov     %rax,%r10
+    and     %rcx,%r10
+    cmp     %rax,%r10
+    jz      .Lsincos_naninf
+
+#  ax = (ux & ~SIGNBIT_DP64);
+    mov     $0x07fffffffffffffff,%r10
+    and     %rcx,%r10                # r10 is ax
+
+##  if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+    mov     $0x03fe921fb54442d18,%rax
+    cmp     %rax,%r10
+    jg      .Lsincos_reduce
+
+##      if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+    mov     $0x03f20000000000000,%rax
+    cmp     %rax,%r10
+    jge     .Lsincos_small
+
+##          if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */
+    mov     $0x03e40000000000000,%rax
+    cmp     %rax,%r10
+    jge     .Lsincos_smaller
+
+                            # sin = x;
+    movsd   .L__real_3ff0000000000000(%rip),%xmm1     # cos = 1.0;
+    jmp     .Lsincos_cleanup   
+
+##          else
+.align 32
+.Lsincos_smaller:
+#              sin = x - x^3 * 0.1666666666666666666;
+#              cos = 1.0 - x*x*0.5;
+
+    movsd   %xmm0,%xmm2
+    movsd   .L__real_3fc5555555555555(%rip),%xmm4     # 0.1666666666666666666
+    mulsd   %xmm2,%xmm2                               # x^2
+    movsd   .L__real_3ff0000000000000(%rip),%xmm1     # 1.0
+    movsd   %xmm2,%xmm3                               # copy of x^2
+
+    mulsd   %xmm0,%xmm2                               # x^3
+    mulsd   .L__real_3fe0000000000000(%rip),%xmm3     # 0.5 * x^2
+    mulsd   %xmm4,%xmm2                               # x^3 * 0.1666666666666666666
+    subsd   %xmm2,%xmm0                               # x - x^3 * 0.1666666666666666666, sin
+    subsd   %xmm3,%xmm1                               # 1 - 0.5 * x^2, cos
+
+    jmp     .Lsincos_cleanup                
+
+
+##      else
+
+.align 16
+.Lsincos_small:
+#          sin = sin_piby4(x, 0.0);
+    movsd   .L__real_3fe0000000000000(%rip),%xmm5     # .5
+
+#  x2 = r * r;
+    movsd   %xmm0,%xmm2
+    mulsd   %xmm0,%xmm2                               # x2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2     - do a sin calculation
+#  zs = (s2 + x2 * (s3 + x2 * (s4 + x2 * (s5 + x2 * s6))));
+
+    movlhps %xmm2,%xmm2
+    movapd  .Lsincosarray+0x50(%rip),%xmm3            # s6
+    movapd  %xmm2,%xmm1                               # move for x4
+    movdqa  .Lsincosarray+0x20(%rip),%xmm5            # s3
+    mulpd   %xmm2,%xmm3                               # x2s6
+    addpd   .Lsincosarray+0x40(%rip),%xmm3            # s5+x2s6
+    mulpd   %xmm2,%xmm5                               # x2s3
+    movapd  %xmm4,p_temp(%rsp)                        # rr move to to memory
+    mulpd   %xmm2,%xmm1                               # x4
+    mulpd   %xmm2,%xmm3                               # x2(s5+x2s6)
+    addpd   .Lsincosarray+0x10(%rip),%xmm5            # s2+x2s3
+    movapd  %xmm1,%xmm4                               # move for x6
+    addpd   .Lsincosarray+0x30(%rip),%xmm3            # s4 + x2(s5+x2s6)
+    mulpd   %xmm2,%xmm5                               # x2(s2+x2s3)
+    mulpd   %xmm2,%xmm4                               # x6
+    addpd   .Lsincosarray(%rip),%xmm5                 # s1+x2(s2+x2s3)
+    mulpd   %xmm4,%xmm3                               # x6(s4 + x2(s5+x2s6))
+
+    movsd   %xmm2,%xmm4                               # xmm4 = x2 for 0.5x2 for cos
+                                                      # xmm2 contains x2 for x3 for sin
+    addpd   %xmm5,%xmm3                               # zs in lower and zc upper
+
+    mulsd   %xmm0,%xmm2                               # xmm2=x3 for sin
+
+    movhlps %xmm3,%xmm5                               # Copy z, xmm5 = cos , xmm3 = sin
+
+    mulsd   .L__real_3fe0000000000000(%rip),%xmm4     # xmm4=r=0.5*x2 for cos term
+    mulsd   %xmm2,%xmm3                               # sin *x3
+    mulsd   %xmm1,%xmm5                               # cos *x4
+    movsd   .L__real_3ff0000000000000(%rip),%xmm2     # 1.0
+    subsd   %xmm4,%xmm2                               # t=1.0-r
+    movsd   .L__real_3ff0000000000000(%rip),%xmm1     # 1.0
+    subsd   %xmm2,%xmm1                               # 1 - t
+    subsd   %xmm4,%xmm1                               # (1-t) -r
+    addsd   %xmm5,%xmm1                               # ((1-t) -r) + cos
+    addsd   %xmm3,%xmm0                               # xmm0= sin+x, final sin term
+    addsd   %xmm2,%xmm1                               # xmm1 = t +{ ((1-t) -r) + cos}, final cos term
+
+    jmp     .Lsincos_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsincos_reduce:
+# change rdx to rcx and r8 to r9
+# rcx= ux, r10 = ax
+# %r9,%rax are free
+
+#  xneg = (ax != ux);
+    cmp     %r10,%rcx
+    mov     $0,%r11d
+
+##  if (xneg) x = -x;
+    jz      .LPositive
+    mov     $1,%r11d
+    subsd   %xmm0,%xmm2
+    movsd   %xmm2,%xmm0
+
+# rcx= ux, r10 = ax, r11= Sign
+# %r9,%rax are free
+# change rdx to rcx and r8 to r9
+
+.align 16
+.LPositive:
+##  if (x < 5.0e5)
+    cmp     .L__real_411E848000000000(%rip),%r10
+    jae     .Lsincos_reduce_precise
+
+# reduce  the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+    movsd   %xmm0,%xmm2
+    movsd   .L__real_3fe45f306dc9c883(%rip),%xmm3     # twobypi
+    movsd   %xmm0,%xmm4
+    movsd   .L__real_3fe0000000000000(%rip),%xmm5     # .5
+    mulsd   %xmm3,%xmm2
+
+#/* How many pi/2 is x a multiple of? */
+#       xexp  = ax >> EXPSHIFTBITS_DP64;
+    shr     $52,%r10                                  # >>EXPSHIFTBITS_DP64
+
+#       npi2  = (int)(x * twobypi + 0.5);
+    addsd   %xmm5,%xmm2                               # npi2
+
+    movsd   .L__real_3ff921fb54400000(%rip),%xmm3     # piby2_1
+    cvttpd2dq    %xmm2,%xmm0                          # convert to integer
+    movsd   .L__real_3dd0b4611a626331(%rip),%xmm1     # piby2_1tail
+    cvtdq2pd    %xmm0,%xmm2                           # and back to float.
+
+#       /* Subtract the multiple from x to get an extra-precision remainder */
+#       rhead  = x - npi2 * piby2_1;
+    mulsd   %xmm2,%xmm3
+    subsd   %xmm3,%xmm4                               # rhead
+
+#       rtail  = npi2 * piby2_1tail;
+    mulsd   %xmm2,%xmm1
+    movd    %xmm0,%eax
+
+
+#       GET_BITS_DP64(rhead-rtail, uy);                   ; originally only rhead
+    movsd   %xmm4,%xmm0
+    subsd   %xmm1,%xmm0
+
+    movsd   .L__real_3dd0b4611a600000(%rip),%xmm3     # piby2_2
+    movsd   %xmm0,p_temp(%rsp)
+    movsd   .L__real_3ba3198a2e037073(%rip),%xmm5     # piby2_2tail
+    mov     %eax,%ecx
+    mov     p_temp(%rsp),%r9                          # rcx is rhead-rtail
+
+#    xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc
+#       expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+    shl     $1,%r9                                    # strip any sign bit
+    shr     $53,%r9                                   # >> EXPSHIFTBITS_DP64 +1
+    sub     %r9,%r10                                  # expdiff
+
+##       if (expdiff > 15)
+    cmp     $15,%r10
+    jle     .Lexpdiff15
+
+#          /* The remainder is pretty small compared with x, which
+#             implies that x is a near multiple of pi/2
+#             (x matches the multiple to at least 15 bits) */
+
+#          t  = rhead;
+    movsd   %xmm4,%xmm1
+
+#          rtail  = npi2 * piby2_2;
+    mulsd   %xmm2,%xmm3
+
+#          rhead  = t - rtail;
+    mulsd   %xmm2,%xmm5                               # npi2 * piby2_2tail
+    subsd   %xmm3,%xmm4                               # rhead
+
+#          rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+    subsd   %xmm4,%xmm1                               # t - rhead
+    subsd   %xmm3,%xmm1                               # -rtail
+    subsd   %xmm1,%xmm5                               # rtail
+
+#      r = rhead - rtail;
+    movsd   %xmm4,%xmm0
+
+#HARSHA
+#xmm1=rtail
+    movsd   %xmm5,%xmm1
+    subsd   %xmm5,%xmm0
+    
+#    xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexpdiff15:
+#      region = npi2 & 3;
+
+    subsd   %xmm0,%xmm4                               # rhead-r
+    subsd   %xmm1,%xmm4                               # rr = (rhead-r) - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+## if the input was close to a pi/2 multiple
+# The original NAG code missed this trick.  If the input is very close to n*pi/2 after
+# reduction,
+# then the sin is ~ 1.0 , to within 53 bits, when r is < 2^-27.  We already
+# have x at this point, so we can skip the sin polynomials.
+
+    cmp     $0x03f2,%r9                               # if r  small.
+    jge     .Lcossin_piby4                            # use taylor series if not
+    cmp     $0x03de,%r9                               # if r really small.
+    jle     .Lr_small                                 # then sin(r) = 0
+
+    movsd   %xmm0,%xmm2
+    mulsd   %xmm2,%xmm2                               # x^2
+
+##      if region is 0 or 2 do a sin calc.
+    and     $1,%ecx
+    jnz     .Lregion13
+
+# region 0 or 2 do a sincos calculation
+# use simply polynomial
+#              sin=x - x*x*x*0.166666666666666666;
+    movsd   .L__real_3fc5555555555555(%rip),%xmm3     # 0.166666666
+    mulsd   %xmm0,%xmm3                               # * x
+    mulsd   %xmm2,%xmm3                               # * x^2
+    subsd   %xmm3,%xmm0                               # xs
+#              cos=1.0 - x*x*0.5;
+    movsd   .L__real_3ff0000000000000(%rip),%xmm1     # 1.0
+    mulsd   .L__real_3fe0000000000000(%rip),%xmm2     # 0.5 *x^2
+    subsd   %xmm2,%xmm1                               # xc
+
+    jmp     .Ladjust_region
+
+.align 16
+.Lregion13:
+# region 1 or 3 do a cossin calculation
+# use simply polynomial
+#              sin=x - x*x*x*0.166666666666666666;
+    movsd   %xmm0,%xmm1
+
+    movsd   .L__real_3fc5555555555555(%rip),%xmm3     # 0.166666666
+    mulsd   %xmm0,%xmm3                               # 0.166666666* x
+    mulsd   %xmm2,%xmm3                               # 0.166666666* x * x^2
+    subsd   %xmm3,%xmm1                               # xs
+#              cos=1.0 - x*x*0.5;
+    movsd   .L__real_3ff0000000000000(%rip),%xmm0     # 1.0
+    mulsd   .L__real_3fe0000000000000(%rip),%xmm2     # 0.5 *x^2
+    subsd   %xmm2,%xmm0                               # xc
+
+    jmp     .Ladjust_region
+
+.align 16
+.Lr_small:
+##      if region is 0 or 2    do a sincos calc.
+    movsd   .L__real_3ff0000000000000(%rip),%xmm1     # cos(r) is a 1
+    and     $1,%ecx
+    jz      .Ladjust_region
+
+##      if region is 1 or 3    do a cossin calc.
+    movsd   %xmm0,%xmm1                               # sin(r) is r
+    movsd   .L__real_3ff0000000000000(%rip),%xmm0     # cos(r) is a 1
+    jmp     .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsincos_reduce_precise:
+#      // Reduce x into range [-pi/4,pi/4]
+#      __amd_remainder_piby2(x, &r, &rr, &region);
+
+    mov     %rdi, p_temp1(%rsp)
+    mov     %rsi, p_temp1+8(%rsp)
+    mov     %r11,p_temp(%rsp)
+
+    lea     region(%rsp),%rdx
+    lea     rr(%rsp),%rsi
+    lea     r(%rsp),%rdi
+    
+    call    __amd_remainder_piby2@PLT
+
+    mov     p_temp1(%rsp), %rdi
+    mov     p_temp1+8(%rsp), %rsi
+    mov     p_temp(%rsp),%r11
+
+    movsd   r(%rsp),%xmm0                             # x
+    movsd   rr(%rsp),%xmm4                            # xx
+    mov     region(%rsp),%eax                         # region to classify for sin/cos calc
+    mov     %eax,%ecx                                 # region to get sign
+
+# xmm0 = x, xmm4 = xx, r8d = 1, eax= region
+.align 16
+.Lcossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# perform taylor series to calc sinx, sinx
+#  x2 = r * r;
+#xmm4 = a part of rr for the sin path, xmm4 is overwritten in the sin path
+#instead use xmm3 because that was freed up in the sin path, xmm3 is overwritten in sin path
+
+    movsd   %xmm0,%xmm2
+    mulsd   %xmm0,%xmm2                               #x2
+
+##      if region is 0 or 2    do a sincos calc.
+    and     $1,%ecx
+    jz      .Lsincos02
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 1 or 3     - do a cossin calculation
+#  zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6))));
+
+
+    movlhps %xmm2,%xmm2
+
+    movapd  .Lcossinarray+0x50(%rip),%xmm3            # s6
+    movapd  %xmm2,%xmm1                               # move for x4
+    movdqa  .Lcossinarray+0x20(%rip),%xmm5            # s3
+    mulpd   %xmm2,%xmm3                               # x2s6
+    addpd   .Lcossinarray+0x40(%rip),%xmm3            # s5+x2s6
+    mulpd   %xmm2,%xmm5                               # x2s3
+    movsd   %xmm4,p_temp(%rsp)                        # rr move to to memory
+    mulpd   %xmm2,%xmm1                               # x4
+    mulpd   %xmm2,%xmm3                               # x2(s5+x2s6)
+    addpd   .Lcossinarray+0x10(%rip),%xmm5            # s2+x2s3
+    movapd  %xmm1,%xmm4                               # move for x6
+    addpd   .Lcossinarray+0x30(%rip),%xmm3            # s4 + x2(s5+x2s6)
+    mulpd   %xmm2,%xmm5                               # x2(s2+x2s3)
+    mulpd   %xmm2,%xmm4                               # x6
+    addpd   .Lcossinarray(%rip),%xmm5                 # s1+x2(s2+x2s3)
+    mulpd   %xmm4,%xmm3                               # x6(s4 + x2(s5+x2s6))
+
+    movsd   %xmm2,%xmm4                               # xmm4 = x2 for 0.5x2 cos
+                                                      # xmm2 contains x2 for x3 sin
+
+    addpd   %xmm5,%xmm3                               # zc in lower and zs in upper
+
+    mulsd   %xmm0,%xmm2                               # xmm2=x3 for the sin term
+
+    movhlps %xmm3,%xmm5                               # Copy z, xmm5 = sin, xmm3 = cos
+    mulsd   .L__real_3fe0000000000000(%rip),%xmm4     # xmm4=r=0.5*x2 for cos term
+
+    mulsd   %xmm2,%xmm5                               # sin *x3
+    mulsd   %xmm1,%xmm3                               # cos *x4
+    movsd   %xmm0,p_temp1(%rsp)                       # store x
+    movsd   %xmm0,%xmm1
+
+    movsd   .L__real_3ff0000000000000(%rip),%xmm2     # 1.0
+    subsd   %xmm4,%xmm2                               # t=1.0-r
+
+    movsd   .L__real_3ff0000000000000(%rip),%xmm0     # 1.0
+    subsd   %xmm2,%xmm0                               # 1 - t
+
+    mulsd   p_temp(%rsp),%xmm1                        # x*xx
+    subsd   %xmm4,%xmm0                               # (1-t) -r
+    subsd   %xmm1,%xmm0                               # ((1-t) -r) - x *xx
+
+    mulsd   p_temp(%rsp),%xmm4                        # 0.5*x2*xx
+
+    addsd   %xmm3,%xmm0                               # (((1-t) -r) - x *xx) + cos
+
+    subsd   %xmm4,%xmm5                               # sin - 0.5*x2*xx
+
+    addsd   %xmm2,%xmm0                               # xmm0 = t +{ (((1-t) -r) - x *xx) + cos}, final cos term
+
+    addsd   p_temp(%rsp),%xmm5                        # sin + xx
+    movsd   p_temp1(%rsp),%xmm1                       # load x
+    addsd   %xmm5,%xmm1                               # xmm1= sin+x, final sin term
+
+    jmp     .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsincos02:
+# region 0 or 2 do a sincos calculation
+    movlhps %xmm2,%xmm2
+
+    movapd  .Lsincosarray+0x50(%rip),%xmm3            # s6
+    movapd  %xmm2,%xmm1                               # move for x4
+    movdqa  .Lsincosarray+0x20(%rip),%xmm5            # s3
+    mulpd   %xmm2,%xmm3                               # x2s6
+    addpd   .Lsincosarray+0x40(%rip),%xmm3            # s5+x2s6
+    mulpd   %xmm2,%xmm5                               # x2s3
+    movsd   %xmm4,p_temp(%rsp)                        # rr move to to memory
+    mulpd   %xmm2,%xmm1                               # x4
+    mulpd   %xmm2,%xmm3                               # x2(s5+x2s6)
+    addpd   .Lsincosarray+0x10(%rip),%xmm5            # s2+x2s3
+    movapd  %xmm1,%xmm4                               # move for x6
+    addpd   .Lsincosarray+0x30(%rip),%xmm3            # s4 + x2(s5+x2s6)
+    mulpd   %xmm2,%xmm5                               # x2(s2+x2s3)
+    mulpd   %xmm2,%xmm4                               # x6
+    addpd   .Lsincosarray(%rip),%xmm5                 # s1+x2(s2+x2s3)
+    mulpd   %xmm4,%xmm3                               # x6(s4 + x2(s5+x2s6))
+
+    movsd   %xmm2,%xmm4                               # xmm4 = x2 for 0.5x2 for cos
+                                                      # xmm2 contains x2 for x3 for sin
+
+    addpd   %xmm5,%xmm3                               # zs in lower and zc in upper
+
+    mulsd   %xmm0,%xmm2                               # xmm2=x3 for sin
+
+    movhlps %xmm3,%xmm5                               # Copy z, xmm5 = cos , xmm3 = sin
+
+    mulsd   .L__real_3fe0000000000000(%rip),%xmm4     # xmm4=r=0.5*x2 for cos term
+
+    mulsd   %xmm2,%xmm3                               # sin *x3
+    mulsd   %xmm1,%xmm5                               # cos *x4
+
+    movsd   .L__real_3ff0000000000000(%rip),%xmm2     # 1.0
+    subsd   %xmm4,%xmm2                               # t=1.0-r
+
+    movsd   .L__real_3ff0000000000000(%rip),%xmm1     # 1.0
+    subsd   %xmm2,%xmm1                               # 1 - t
+
+    movsd   %xmm0,p_temp1(%rsp)                       # store x
+    mulsd   p_temp(%rsp),%xmm0                        # x*xx
+
+    subsd   %xmm4,%xmm1                               # (1-t) -r
+    subsd   %xmm0,%xmm1                               # ((1-t) -r) - x *xx
+
+    mulsd   p_temp(%rsp),%xmm4                        # 0.5*x2*xx
+
+    addsd   %xmm5,%xmm1                               # (((1-t) -r) - x *xx) + cos
+
+    subsd   %xmm4,%xmm3                               # sin - 0.5*x2*xx
+
+    addsd   %xmm2,%xmm1                               # xmm1 = t +{ (((1-t) -r) - x *xx) + cos}, final cos term
+
+    addsd   p_temp(%rsp),%xmm3                        # sin + xx
+    movsd   p_temp1(%rsp),%xmm0                       # load x
+    addsd   %xmm3,%xmm0                               # xmm0= sin+x, final sin term
+
+    jmp     .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#      switch (region)
+.align 16
+.Ladjust_region:        # positive or negative for sin return val in xmm0
+
+    mov     %eax,%r9d
+
+    shr     $1,%eax
+    mov     %eax,%ecx
+    and     %r11d,%eax
+
+    not     %ecx
+    not     %r11d
+    and     %r11d,%ecx
+
+    or      %ecx,%eax
+    and     $1,%eax
+    jnz     .Lcos_sign
+
+## if the original region 0, 1 and arg is negative, then we negate the result.
+## if the original region 2, 3 and arg is positive, then we negate the result.
+    movsd   %xmm0,%xmm2
+    xorpd   %xmm0,%xmm0
+    subsd   %xmm2,%xmm0
+
+.Lcos_sign:            # positive or negative for cos return val in xmm1
+    add     $1,%r9
+    and     $2,%r9d
+    jz      .Lsincos_cleanup
+## if the original region 1 or 2 then we negate the result.
+    movsd   %xmm1,%xmm2
+    xorpd   %xmm1,%xmm1
+    subsd   %xmm2,%xmm1
+
+#.align 16
+.Lsincos_cleanup:
+    movsd   %xmm0, (%rdi)                             # save the sin
+    movsd   %xmm1, (%rsi)                             # save the cos
+
+    add     $stack_size,%rsp
+    ret
+
+.align 16
+.Lsincos_naninf:
+   call     fname_special
+   add      $stack_size, %rsp
+   ret
+

diff --git a/src/gas/sincosf.S b/src/gas/sincosf.S
new file mode 100644
index 0000000..dcdbe9a
--- /dev/null
+++ b/src/gas/sincosf.S

@@ -0,0 +1,402 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# An implementation of the sincosf function.
+#
+# Prototype:
+#
+#     void fastsincosf(float x, float * sinfx, float * cosfx);
+#
+#   Computes sinf(x) and cosf(x).  
+#   It will provide proper C99 return values,
+#   but may not raise floating point status bits properly.
+#   Based on the NAG C implementation.
+#   Author: Harsha Jagasia
+#   Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 32
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000  # 1.0
+                        .quad 0                       # for alignment
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000  # 0.5
+                        .quad 0
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555  # 0.166666666666
+                        .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883  # twobypi
+                        .quad 0
+.L__real_3FF921FB54442D18: .quad 0x03FF921FB54442D18  # piby2
+                        .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000  # piby2_1
+                        .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331  # piby2_1tail
+                        .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000  # piby2_2
+                        .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073  # piby2_2tail
+                        .quad 0                                        
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000  # mask for stripping head and tail
+                        .quad 0               
+.L__real_8000000000000000: .quad 0x08000000000000000  # -0  or signbit
+                        .quad 0
+.L__real_411E848000000000: .quad 0x415312d000000000   # 5e6 0x0411E848000000000  # 5e5
+                        .quad 0
+
+.align 32
+.Lcsarray:
+    .quad    0x0bfc5555555555555                      # -0.166667           s1
+    .quad    0x03fa5555555555555                      # 0.0416667           c1
+    .quad    0x03f81111111110bb3                      # 0.00833333          s2
+    .quad    0x0bf56c16c16c16967                      # -0.00138889         c2
+    .quad    0x0bf2a01a019e83e5c                      # -0.000198413        s3
+    .quad    0x03efa01a019f4ec90                      # 2.48016e-005        c3
+    .quad    0x03ec71de3796cde01                      # 2.75573e-006        s4
+    .quad    0x0be927e4fa17f65f6                      # -2.75573e-007       c4
+    .quad    0x0be5ae600b42fdfa7                      # -2.50511e-008       s5
+    .quad    0x03e21eeb69037ab78                      # 2.08761e-009        c5
+    .quad    0x03de5e0b2f9a43bb8                      # 1.59181e-010        s6
+    .quad    0x0bda907db46cc5e42                      # -1.13826e-011       c6
+
+.text
+.align 16
+.p2align 4,,15
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(sincosf)
+#define fname_special _sincosf_special@PLT
+
+# define local variable storage offsets
+.equ   p_temp,     0x30                               # temporary for get/put bits operation
+.equ   p_temp1,    0x40                               # temporary for get/put bits operation
+.equ   p_temp2,    0x50                               # temporary for get/put bits operation
+.equ   p_temp3,    0x60                               # temporary for get/put bits operation
+.equ   region,     0x70                               # pointer to region for amd_remainder_piby2
+.equ   r,          0x80                               # pointer to r for amd_remainder_piby2
+.equ   stack_size, 0xa8
+
+.globl fname
+.type  fname,@function
+
+fname:
+    sub     $stack_size, %rsp
+
+    xorpd   %xmm2,%xmm2
+    
+#  GET_BITS_DP64(x, ux);
+# convert input to double.
+    cvtss2sd    %xmm0,%xmm0
+# get the input value to an integer register.
+    movsd   %xmm0,p_temp(%rsp)
+    mov     p_temp(%rsp),%rdx                         # rdx is ux
+
+##  if NaN or inf
+    mov     $0x07ff0000000000000,%rax
+    mov     %rax,%r10
+    and     %rdx,%r10
+    cmp     %rax,%r10
+    jz      .L__sc_naninf
+    
+#  ax = (ux & ~SIGNBIT_DP64);
+    mov     $0x07fffffffffffffff,%r10
+    and     %rdx,%r10                                 # r10 is ax
+
+##  if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+    mov     $0x03fe921fb54442d18,%rax
+    cmp     %rax,%r10
+    jg      .L__sc_reduce
+
+##          if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+   mov      $0x3f20000000000000, %rax
+   cmp      %rax, %r10
+   jge      .L__sc_notsmallest
+
+#                  sinf = x, cos=1.0
+   movsd    .L__real_3ff0000000000000(%rip),%xmm1
+   jmp      .L__sc_cleanup
+   
+#          *s = sin_piby4(x, 0.0);
+#          *c = cos_piby4(x, 0.0);
+.L__sc_notsmallest:
+    xor     %eax,%eax                                 # region 0
+    mov     %r10,%rdx
+    movsd   .L__real_3fe0000000000000(%rip),%xmm5     # .5
+    jmp     .L__sc_piby4      
+
+.L__sc_reduce:    
+
+# reduce  the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+
+#  xneg = (ax != ux);
+    cmp     %r10,%rdx
+##  if (xneg) x = -x;
+    jz      .Lpositive
+    subsd   %xmm0,%xmm2
+    movsd   %xmm2,%xmm0
+
+.align 16
+.Lpositive:
+##  if (x < 5.0e5)
+    cmp     .L__real_411E848000000000(%rip),%r10
+    jae     .Lsincosf_reduce_precise
+
+    movsd   %xmm0,%xmm2
+    movsd   %xmm0,%xmm4
+
+    mulsd   .L__real_3fe45f306dc9c883(%rip),%xmm2     # twobypi
+    movsd   .L__real_3fe0000000000000(%rip),%xmm5     # .5 
+
+#/* How many pi/2 is x a multiple of? */
+#      xexp  = ax >> EXPSHIFTBITS_DP64;
+    mov     %r10,%r9
+    shr     $52,%r9                                   # >> EXPSHIFTBITS_DP64
+
+#        npi2  = (int)(x * twobypi + 0.5);
+    addsd   %xmm5,%xmm2                               # npi2
+
+    movsd   .L__real_3ff921fb54400000(%rip),%xmm3     # piby2_1
+    cvttpd2dq    %xmm2,%xmm0                          # convert to integer 
+    movsd   .L__real_3dd0b4611a626331(%rip),%xmm1     # piby2_1tail    
+    cvtdq2pd    %xmm0,%xmm2                           # and back to float.    
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+#      rhead  = x - npi2 * piby2_1;
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+#       rhead  = x - npi2 * piby2_1;
+
+    mulsd   %xmm2,%xmm3                               # use piby2_1
+    subsd   %xmm3,%xmm4                               # rhead
+
+#      rtail  = npi2 * piby2_1tail;
+    mulsd   %xmm2,%xmm1                               # rtail
+
+    movd    %xmm0,%eax
+
+#      GET_BITS_DP64(rhead-rtail, uy);               ; originally only rhead
+    movsd   %xmm4,%xmm0
+    subsd   %xmm1,%xmm0
+
+    movsd   .L__real_3dd0b4611a600000(%rip),%xmm3     # piby2_2
+    movsd   .L__real_3ba3198a2e037073(%rip),%xmm5     # piby2_2tail
+    movd    %xmm0,%rcx
+
+#      expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+    shl     $1,%rcx                                   # strip any sign bit
+    shr     $53,%rcx                                  # >> EXPSHIFTBITS_DP64 +1
+    sub     %rcx,%r9                                  # expdiff
+
+##      if (expdiff > 15)
+    cmp     $15,%r9
+    jle     .Lexpdiff15
+
+#          /* The remainder is pretty small compared with x, which
+#             implies that x is a near multiple of pi/2
+#             (x matches the multiple to at least 15 bits) */
+
+#          t  = rhead;
+    movsd   %xmm4,%xmm1
+
+#          rtail  = npi2 * piby2_2;
+    mulsd   %xmm2,%xmm3
+
+#          rhead  = t - rtail;
+    mulsd   %xmm2,%xmm5                               # npi2 * piby2_2tail
+    subsd   %xmm3,%xmm4                               # rhead
+
+#          rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+    subsd   %xmm4,%xmm1                               # t - rhead
+    subsd   %xmm3,%xmm1                               # -rtail
+    subsd   %xmm1,%xmm5                               # rtail
+
+#      r = rhead - rtail;
+    movsd   %xmm4,%xmm0
+
+#HARSHA
+#xmm1=rtail
+    movsd   %xmm5,%xmm1
+    subsd   %xmm5,%xmm0
+
+#      region = npi2 & 3;
+#    and    $3,%eax
+#    xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexpdiff15:
+
+## if the input was close to a pi/2 multiple
+#
+
+    cmp     $0x03f2,%rcx                              # if r  small.
+    jge     .L__sc_piby4                              # use taylor series if not
+    cmp     $0x03de,%rcx                              # if r really small.
+    jle     .Lsinsmall                                # then sin(r) = r
+
+    movsd   %xmm0,%xmm2 
+    mulsd   %xmm2,%xmm2                               # x^2
+# use simply polynomial
+#              *s = x - x*x*x*0.166666666666666666;
+    movsd   .L__real_3fc5555555555555(%rip),%xmm3     # 
+    mulsd   %xmm0,%xmm3                               # * x
+    mulsd   %xmm2,%xmm3                               # * x^2
+    subsd   %xmm3,%xmm0                               # xs
+
+#              *c = 1.0 - x*x*0.5;
+    movsd   .L__real_3ff0000000000000(%rip),%xmm1     # 1.0
+    mulsd   .L__real_3fe0000000000000(%rip),%xmm2     # 0.5 *x^2
+    subsd   %xmm2,%xmm1
+    jmp     .L__adjust_region
+
+.Lsinsmall:                                           # then sin(r) = r
+    movsd   .L__real_3ff0000000000000(%rip),%xmm1     # cos(r) is a 1 
+    jmp     .L__adjust_region
+
+# perform taylor series to calc sinx, cosx
+#  COS
+#  x2 = x * x;
+#  return (1.0 - 0.5 * x2 + (x2 * x2 *
+#                            (c1 + x2 * (c2 + x2 * (c3 + x2 * c4)))));
+#  x2 = x * x;
+#  return (1.0 - 0.5 * x2 + (x2 * x2 *
+#                            (c1 + x2 * (c2 + x2 * (c3 + x2 * c4)))));                                                    
+#  SIN
+#  zc,zs = (c2 + x2 * (c3 + x2 * c4 ));
+#  xs =  r + x3 * (sc1 + x2 * zs);
+#  x2 = x * x;
+#  return (x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))));
+# done with reducing the argument.  Now perform the sin/cos calculations.
+.align 16
+.L__sc_piby4:
+#  x2 = r * r;
+    movsd   .L__real_3fe0000000000000(%rip),%xmm5     # .5 
+    movsd   %xmm0,%xmm2
+    mulsd   %xmm0,%xmm2                               # x2
+    shufpd  $0,%xmm2,%xmm2                            # x2,x2
+    movsd   %xmm2,%xmm4
+    mulsd   %xmm4,%xmm4                               # x4
+    shufpd  $0,%xmm4,%xmm4                            # x4,x4
+
+#  x2m =    _mm_set1_pd (x2);
+#  zc,zs = (c2 + x2 * (c3 + x2 * c4 ));
+#    xs =  r + x3 * (sc1 + x2 * zs);
+#    xc =  t + ( x2 * x2 * (cc1 + x2 * zc));
+    movapd  .Lcsarray+0x30(%rip),%xmm1                # c4
+    movapd  .Lcsarray+0x10(%rip),%xmm3                # c2
+    mulpd   %xmm2,%xmm1                               # x2c4
+    mulpd   %xmm2,%xmm3                               # x2c2
+
+#  rc = 0.5 * x2;
+    mulsd   %xmm2,%xmm5                               #rc
+    mulsd   %xmm0,%xmm2                               #x3
+
+    addpd   .Lcsarray+0x20(%rip),%xmm1                # c3 + x2c4
+    addpd   .Lcsarray(%rip),%xmm3                     # c1 + x2c2
+    mulpd   %xmm4,%xmm1                               #    x4(c3 + x2c4)
+    addpd   %xmm3,%xmm1                               # c1 + x2c2 + x4(c3 + x2c4)
+
+#  -t = rc-1;
+    subsd   .L__real_3ff0000000000000(%rip),%xmm5     # 1.0  
+# now we have the poly for sin in the low half, and cos in upper half
+    mulsd   %xmm1,%xmm2                               # x3(sin poly)
+    shufpd  $3,%xmm1,%xmm1                            # get cos poly to low half of register
+    mulsd   %xmm4,%xmm1                               # x4(cos poly)
+
+    addsd   %xmm2,%xmm0                               # sin = r+...
+    subsd   %xmm5,%xmm1                               # cos = poly-(-t)
+
+.L__adjust_region:                                    # xmm0 is sin, xmm1 is cos
+#      switch (region)
+    mov     %eax,%ecx
+    and     $1,%eax
+    jz      .Lregion02
+# region 1 or 3
+    movsd   %xmm0,%xmm2                               # swap sin,cos
+    movsd   %xmm1,%xmm0                               # sin = cos
+    xorpd   %xmm1,%xmm1
+    subsd   %xmm2,%xmm1                               # cos = -sin
+   
+.Lregion02:
+    and     $2,%ecx
+    jz      .Lregion23
+# region 2 or 3
+    movsd   %xmm0,%xmm2
+    movsd   %xmm1,%xmm3
+    xorpd   %xmm0,%xmm0
+    xorpd   %xmm1,%xmm1
+    subsd   %xmm2,%xmm0                               # sin = -sin
+    subsd   %xmm3,%xmm1                               # cos = -cos
+  
+.Lregion23:        
+##  if (xneg) *s = -*s ;
+    cmp     %r10,%rdx
+    jz      .L__sc_cleanup
+    movsd   %xmm0,%xmm2
+    xorpd   %xmm0,%xmm0
+    subsd   %xmm2,%xmm0                               # sin = -sin
+
+.align 16
+.L__sc_cleanup:
+    cvtsd2ss %xmm0,%xmm0                              # convert back to floats
+    cvtsd2ss %xmm1,%xmm1
+
+    movss   %xmm0,(%rdi)                              # save the sin
+    movss   %xmm1,(%rsi)                              # save the cos
+    
+    add     $stack_size,%rsp
+    ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsincosf_reduce_precise:
+#      /* Reduce abs(x) into range [-pi/4,pi/4] */
+#      __amd_remainder_piby2(ax, &r, &region);
+
+    mov     %rdx,p_temp(%rsp)                         # save ux for use later
+    mov     %r10,p_temp1(%rsp)                        # save ax for use later
+    mov     %rdi,p_temp2(%rsp)                        # save ux for use later
+    mov     %rsi,p_temp3(%rsp)                        # save ax for use later
+    movd    %xmm0,%rdi
+    lea     r(%rsp),%rsi
+    lea     region(%rsp),%rdx
+    sub     $0x040,%rsp    
+
+    call    __amd_remainder_piby2d2f@PLT
+
+    add     $0x040,%rsp
+    mov     p_temp(%rsp),%rdx                         # restore ux for use later
+    mov     p_temp1(%rsp),%r10                        # restore ax for use later    
+    mov     p_temp2(%rsp),%rdi                        # restore ux for use later
+    mov     p_temp3(%rsp),%rsi                        # restore ax for use later    
+
+    mov     $1,%r8d                                   # for determining region later on
+    movsd   r(%rsp),%xmm0                             # r
+    mov     region(%rsp),%eax                         # region
+    jmp     .L__sc_piby4
+ 
+.align 16    
+.L__sc_naninf:
+    cvtsd2ss %xmm0,%xmm0                              # convert back to floats
+    call    fname_special							  # rdi and rsi are ready for the function call
+    add     $stack_size, %rsp
+    ret

diff --git a/src/gas/sinf.S b/src/gas/sinf.S
new file mode 100644
index 0000000..c2083ff
--- /dev/null
+++ b/src/gas/sinf.S

@@ -0,0 +1,436 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# An implementation of the sinf function.
+#
+# Prototype:
+#
+#     double sinf(double x);
+#
+#   Computes sinf(x).
+#   It will provide proper C99 return values,
+#   but may not raise floating point status bits properly.
+#   Based on the NAG C implementation.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 32
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000  # 1.0
+                  .quad 0                             # for alignment
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000  # 0.5
+                  .quad 0
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555  # 0.166666666666
+                  .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883  # twobypi
+                  .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000  # piby2_1
+                  .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331  # piby2_1tail
+                  .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000  # piby2_2
+                  .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073  # piby2_2tail
+                  .quad 0               
+.L__real_411E848000000000: .quad 0x415312d000000000   # 5e6 0x0411E848000000000  # 5e5
+                  .quad 0
+
+.align 32
+.Lcosfarray:
+   .quad   0x0bfe0000000000000                        # -0.5            c0
+   .quad   0
+   .quad   0x03fa5555555555555                        # 0.0416667       c1
+   .quad   0
+   .quad   0x0bf56c16c16c16c16                        # -0.00138889     c2
+   .quad   0
+   .quad   0x03EFA01A01A01A019                        # 2.48016e-005    c3
+   .quad   0
+   .quad   0x0be927e4fb7789f5c                        # -2.75573e-007   c4
+   .quad   0
+
+.align 32   
+.Lsinfarray:
+   .quad   0x0bfc5555555555555                        # -0.166667       s1
+   .quad   0
+   .quad   0x03f81111111111111                        # 0.00833333      s2
+   .quad   0
+   .quad   0x0bf2a01a01a01a01a                        # -0.000198413    s3
+   .quad   0
+   .quad   0x03ec71de3a556c734                        # 2.75573e-006    s4
+   .quad   0
+
+.text
+.align 32
+.p2align 4,,15
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(sinf)
+#define fname_special _sinf_special@PLT
+
+# define local variable storage offsets
+.equ   p_temp,     0x30                               # temporary for get/put bits operation
+.equ   p_temp1,    0x40                               # temporary for get/put bits operation
+.equ   r,          0x50                               # pointer to r for amd_remainder_piby2
+.equ   region,     0x60                               # pointer to region for amd_remainder_piby2
+.equ   stack_size, 0x88
+
+.globl fname
+.type  fname,@function
+
+fname:
+   sub      $stack_size, %rsp
+   xorpd    %xmm2, %xmm2                              # zeroed out for later use
+
+##  if NaN or inf
+   movd     %xmm0, %edx
+   mov      $0x07f800000, %eax
+   mov      %eax, %r10d
+   and      %edx, %r10d
+   cmp      %eax, %r10d
+   jz       .Lsinf_naninf
+
+# GET_BITS_DP64(x, ux);
+# get the input value to an integer register.
+   cvtss2sd %xmm0, %xmm0         # convert input to double.
+   movsd    %xmm0,p_temp(%rsp)   # get the input value to an integer register.      
+      
+   mov   p_temp(%rsp), %rdx      # rdx is ux
+
+#  ax = (ux & ~SIGNBIT_DP64);
+   mov      $0x07fffffffffffffff, %r10
+   and      %rdx, %r10            # r10 is ax
+   mov      $1, %r8d            # for determining region later on
+
+##  if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+   mov      $0x03fe921fb54442d18, %rax
+   cmp      %rax, %r10
+   jg       .Lsinf_reduce
+
+##      if (ax < 0x3f80000000000000) /* abs(x) < 2.0^(-7) */
+   mov      $0x3f80000000000000, %rax
+   cmp      %rax, %r10
+   jge      .Lsinf_small
+
+##          if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+   mov      $0x3f20000000000000, %rax
+   cmp      %rax, %r10
+   jge      .Lsinf_smaller
+
+#                  sinf = x;
+   jmp      .Lsinf_cleanup         # done
+
+##          else
+
+.Lsinf_smaller:
+#              sinf = x - x^3 * 0.1666666666666666666;
+   movsd    %xmm0, %xmm2
+   movsd    .L__real_3fc5555555555555(%rip), %xmm4   # 0.1666666666666666666
+   mulsd    %xmm2, %xmm2            # x^2
+   mulsd    %xmm0, %xmm2            # x^3
+   mulsd    %xmm4, %xmm2            # x^3 * 0.1666666666666666666
+   subsd    %xmm2, %xmm0            # x - x^3 * 0.1666666666666666666
+   jmp      .Lsinf_cleanup          
+
+.Lsinf_small:
+   movsd    %xmm0, %xmm2
+   mulsd    %xmm0, %xmm2            # x2
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2    - do a sinf calculation
+#  zs = x + x3((s1 + x2 * s2) + x4(s3 + x2 * s4));
+   movsd    .Lsinfarray+0x30(%rip), %xmm1   # s4
+   mulsd    %xmm2, %xmm1                 # s4x2
+   movsd    %xmm2, %xmm4                    # move for x4
+   movsd    .Lsinfarray+0x10(%rip), %xmm5   # s2
+   mulsd    %xmm2, %xmm4                 # x4
+   movsd    %xmm0, %xmm3                     # move for x3
+   mulsd    %xmm2, %xmm5                     # s2x2
+   mulsd    %xmm2, %xmm3                     # x3      
+   addsd    .Lsinfarray+0x20(%rip), %xmm1   # s3+s4x2
+   mulsd    %xmm4, %xmm1                    # s3x4+s4x6
+   addsd    .Lsinfarray(%rip), %xmm5           # s1+s2x2
+   addsd    %xmm5, %xmm1                    # s1+s2x2+s3x4+s4x6
+   mulsd    %xmm3, %xmm1                    # x3(s1+s2x2+s3x4+s4x6)
+   addsd    %xmm1, %xmm0                    # x + x3(s1+s2x2+s3x4+s4x6)
+   jmp      .Lsinf_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 32
+.Lsinf_reduce:
+
+#  xneg = (ax != ux);
+   cmp      %r10, %rdx
+   mov      $0, %r11d
+
+##  if (xneg) x = -x;
+   jz       .L50e5
+   mov      $1, %r11d
+   subsd    %xmm0, %xmm2
+   movsd    %xmm2, %xmm0
+
+.L50e5:
+##  if (x < 5.0e5)
+   cmp      .L__real_411E848000000000(%rip), %r10
+   jae      .Lsinf_reduce_precise
+
+# reduce  the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+   movsd    %xmm0, %xmm2
+   movsd    .L__real_3fe45f306dc9c883(%rip), %xmm3            # twobypi
+   movsd    %xmm0, %xmm4
+   movsd    .L__real_3fe0000000000000(%rip), %xmm5            # .5
+   mulsd    %xmm3, %xmm2
+
+#/* How many pi/2 is x a multiple of? */
+#      xexp  = ax >> EXPSHIFTBITS_DP64;
+   mov      %r10, %r9
+   shr      $52, %r9                  #>>EXPSHIFTBITS_DP64
+
+#        npi2  = (int)(x * twobypi + 0.5);
+   addsd    %xmm5, %xmm2                  # npi2
+
+   movsd    .L__real_3ff921fb54400000(%rip), %xmm3         # piby2_1
+   cvttpd2dq   %xmm2, %xmm0               # convert to integer
+   movsd    .L__real_3dd0b4611a626331(%rip), %xmm1         # piby2_1tail
+   cvtdq2pd   %xmm0, %xmm2               # and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+#      rhead  = x - npi2 * piby2_1;
+   mulsd    %xmm2, %xmm3
+   subsd    %xmm3, %xmm4                  # rhead
+
+#      rtail  = npi2 * piby2_1tail;
+   mulsd    %xmm2, %xmm1
+   movd     %xmm0, %eax
+
+#      GET_BITS_DP64(rhead-rtail, uy);               ; originally only rhead
+   movsd    %xmm4, %xmm0
+   subsd    %xmm1, %xmm0
+
+   movsd    .L__real_3dd0b4611a600000(%rip), %xmm3      # piby2_2
+   movsd    %xmm0,p_temp(%rsp)
+   movsd    .L__real_3ba3198a2e037073(%rip), %xmm5      # piby2_2tail
+   mov      p_temp(%rsp), %rcx         # rcx is rhead-rtail
+
+#   xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc
+#      expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+   shl      $1, %rcx               # strip any sign bit
+   shr      $53, %rcx               #>> EXPSHIFTBITS_DP64 +1
+   sub      %rcx, %r9               #expdiff
+
+##      if (expdiff > 15)
+   cmp      $15, %r9
+   jle      .Lexpdiff15
+
+#          /* The remainder is pretty small compared with x, which
+#             implies that x is a near multiple of pi/2
+#             (x matches the multiple to at least 15 bits) */
+
+#          t  = rhead;
+   movsd    %xmm4, %xmm1
+
+#          rtail  = npi2 * piby2_2;
+   mulsd    %xmm2, %xmm3
+
+#          rhead  = t - rtail;
+   mulsd    %xmm2, %xmm5            # npi2 * piby2_2tail
+   subsd    %xmm3, %xmm4            # rhead
+
+#          rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+   subsd    %xmm4, %xmm1            # t - rhead
+   subsd    %xmm3, %xmm1            # -rtail
+   subsd    %xmm1, %xmm5            #rtail
+
+#      r = rhead - rtail;
+   movsd    %xmm4, %xmm0
+
+#HARSHA
+#xmm1=rtail
+   movsd    %xmm5, %xmm1
+   subsd    %xmm5, %xmm0
+
+#   xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexpdiff15:
+#      region = npi2 & 3;
+# No need rr for float case
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+## if the input was close to a pi/2 multiple
+# The original NAG code missed this trick.  If the input is very close to n*pi/2 after
+# reduction,
+# then the sinf is ~ 1.0 , to within 15 bits, when r is < 2^-13.  We already
+# have x at this point, so we can skip the sinf polynomials.
+
+   cmp      $0x03f2, %rcx            ## if r  small.
+   jge      .Lsinf_piby4            # use taylor series if not
+   cmp      $0x03de, %rcx            ## if r really small.
+   jle      .Lr_small               # then sinf(r) = 0
+
+   movsd    %xmm0, %xmm2
+   mulsd    %xmm2, %xmm2            #x^2
+
+##      if region is 0 or 2 do a sinf calc.
+   and      %eax, %r8d
+   jnz      .Lcosfregion
+
+# region 0 or 2 do a sinf calculation
+# use simply polynomial
+#              x - x*x*x*0.166666666666666666;
+   movsd    .L__real_3fc5555555555555(%rip), %xmm3         #
+   mulsd    %xmm0, %xmm3                   # * x
+   mulsd    %xmm2, %xmm3                  # * x^2
+   subsd    %xmm3, %xmm0                   # xs
+   jmp      .Ladjust_region
+
+.align 32
+.Lcosfregion:
+# region 1 or 3 do a cosf calculation
+# use simply polynomial
+#              1.0 - x*x*0.5;
+   movsd    .L__real_3ff0000000000000(%rip), %xmm0         # 1.0
+   mulsd    .L__real_3fe0000000000000(%rip), %xmm2         # 0.5 *x^2
+   subsd    %xmm2, %xmm0                  # xc
+   jmp      .Ladjust_region
+
+.align 32
+.Lr_small:
+##      if region is 1 or 3   do a cosf calc.
+   and      %eax, %r8d
+   jz       .Ladjust_region
+
+# odd
+   movsd    .L__real_3ff0000000000000(%rip), %xmm0         # cosf(r) is a 1
+   jmp      .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsinf_reduce_precise:
+#      // Reduce x into range [-pi/4,pi/4]
+#      __amd_remainder_piby2d2f(x, &r, &region);
+
+   mov      %r11,p_temp(%rsp)
+   lea      region(%rsp), %rdx
+   lea      r(%rsp), %rsi
+   movd     %xmm0, %rdi
+   sub      $0x20, %rsp
+   
+   call     __amd_remainder_piby2d2f@PLT
+
+   add      $0x20, %rsp
+   mov      p_temp(%rsp), %r11
+   mov      $1, %r8d            # for determining region later on
+   movsd    r(%rsp), %xmm1      #//x
+   mov      region(%rsp), %eax   #//region
+
+# xmm0 = x, xmm4 = xx, r8d = 1, eax= region
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# perform taylor series to calc sinfx, cosfx
+.Lsinf_piby4:
+#  x2 = r * r;
+   movsd    %xmm0, %xmm2
+   mulsd    %xmm0, %xmm2                  #x2
+
+##      if region is 0 or 2   do a sinf calc.
+   and      %eax, %r8d
+   jnz      .Lcosfregion2
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2 do a sinf calculation
+#  zs = x + x3((s1 + x2 * s2) + x4(s3 + x2 * s4));
+   movsd    .Lsinfarray+0x30(%rip), %xmm1   # s4
+   mulsd    %xmm2, %xmm1                 # s4x2
+   movsd    %xmm2, %xmm4                    # move for x4   
+   mulsd    %xmm2, %xmm4                 # x4
+   movsd    .Lsinfarray+0x10(%rip), %xmm5   # s2
+   mulsd    %xmm2, %xmm5                     # s2x2
+   movsd    %xmm0, %xmm3                     # move for x3
+   mulsd    %xmm2, %xmm3                     # x3      
+   addsd    .Lsinfarray+0x20(%rip), %xmm1   # s3+s4x2
+   mulsd    %xmm4, %xmm1                    # s3x4+s4x6
+   addsd    .Lsinfarray(%rip), %xmm5           # s1+s2x2
+   addsd    %xmm5, %xmm1                    # s1+s2x2+s3x4+s4x6
+   mulsd    %xmm3, %xmm1                    # x3(s1+s2x2+s3x4+s4x6)
+   addsd    %xmm1, %xmm0                    # x + x3(s1+s2x2+s3x4+s4x6)
+      
+   jmp      .Ladjust_region
+
+.align 32
+.Lcosfregion2:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 1 or 3    - do a cosf calculation
+#    zc = 1-0.5*x2+ c1*x4 +c2*x6 +c3*x8 + c4*x10 for a higher precision
+   movsd    .Lcosfarray+0x40(%rip), %xmm1    # c4
+   movsd    %xmm2, %xmm4                     # move for x4
+   mulsd    %xmm2, %xmm1                     # c4x2
+   movsd    .Lcosfarray+0x20(%rip), %xmm3    # c2
+   mulsd    %xmm2, %xmm4                     # x4
+   movsd    .Lcosfarray(%rip), %xmm0         # c0
+   mulsd    %xmm2, %xmm3                     # c2x2
+   mulsd    %xmm2, %xmm0                     # c0x2 (=-0.5x2)
+   addsd    .Lcosfarray+0x30(%rip), %xmm1      # c3+c4x2
+   mulsd    %xmm4, %xmm1                     # c3x4 + c4x6
+   addsd    .Lcosfarray+0x10(%rip), %xmm3      # c1+c2x2
+   addsd    %xmm3, %xmm1                     # c1 + c2x2 + c3x4 + c4x6
+   mulsd    %xmm4, %xmm1                  # c1x4 + c2x6 + c3x8 + c4x10
+   addsd    .L__real_3ff0000000000000(%rip), %xmm0 # 1 - 0.5x2
+   addsd    %xmm1, %xmm0                     # 1 - 0.5x2 + c1x4 + c2x6 + c3x8 + c4x10
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 32
+.Ladjust_region:      # positive or negative
+#      switch (region)
+   shr      $1, %eax
+   mov      %eax, %ecx
+   and      %r11d, %eax
+
+   not      %ecx
+   not      %r11d
+   and      %r11d, %ecx
+
+   or       %ecx, %eax
+   and      $1, %eax
+   jnz      .Lsinf_cleanup
+
+## if the original region 0, 1 and arg is negative, then we negate the result.
+## if the original region 2, 3 and arg is positive, then we negate the result.
+   movsd    %xmm0, %xmm2
+   xorpd    %xmm0, %xmm0
+   subsd    %xmm2, %xmm0
+
+.align 32
+.Lsinf_cleanup:
+   cvtsd2ss %xmm0, %xmm0
+   add      $stack_size, %rsp
+   ret
+
+.align 32
+.Lsinf_naninf:
+   call     fname_special
+   add      $stack_size, %rsp
+   ret
+
+

diff --git a/src/gas/trunc.S b/src/gas/trunc.S
new file mode 100644
index 0000000..c29d0fd
--- /dev/null
+++ b/src/gas/trunc.S

@@ -0,0 +1,87 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# trunc.S
+#
+# An implementation of the trunc libm function.
+#
+# The trunc functions round their argument to the integer value, in floating format,
+# nearest to but no larger in magnitude than the argument.
+#
+#
+# Prototype:
+#
+#     double trunc(double x);
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(trunc)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+	MOVAPD %xmm0,%xmm1	
+
+#convert double to integer.	
+	CVTTSD2SIQ %xmm0,%rax
+	CMP .L__Erro_mask(%rip),%rax
+	jz .Error_val	
+#convert integer to double	
+	CVTSI2SDQ %rax,%xmm0
+	
+	PSRLQ $63,%xmm1
+	PSLLQ $63,%xmm1
+	
+	POR %xmm1,%xmm0
+	
+	
+	ret
+	
+.Error_val:
+	MOVAPD %xmm1,%xmm2
+	CMPEQSD	%xmm1,%xmm1			
+	ADDSD %xmm2,%xmm2	
+	
+	PAND %xmm1,%xmm0
+	PANDN %xmm2,%xmm1
+	POR %xmm1,%xmm0
+	
+		
+    ret
+
+.data
+.align 16
+.L__Erro_mask:  		.quad 0x8000000000000000
+			        .quad 0x0

diff --git a/src/gas/truncf.S b/src/gas/truncf.S
new file mode 100644
index 0000000..c73ad8f
--- /dev/null
+++ b/src/gas/truncf.S

@@ -0,0 +1,93 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# truncf.S
+#
+# An implementation of the truncf libm function.
+#
+#
+# The trunf functions round their argument to the integer value, in floating format,
+# nearest to but no larger in magnitude than the argument.
+#
+#
+# Prototype:
+#
+#     float truncf(float x);
+#
+
+#
+#   Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(truncf )
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+	
+	
+	MOVAPD %xmm0,%xmm1	
+
+# convert  float to integer.
+	CVTTSS2SIQ %xmm0,%rax
+	
+	CMP .L__Erro_mask(%rip),%rax
+	jz .Error_val	
+	
+# convert  integer	to float
+	CVTSI2SSQ %rax,%xmm0
+	
+	PSRLD $31,%xmm1
+	PSLLD $31,%xmm1
+	
+	POR %xmm1,%xmm0
+	
+	
+	ret
+	
+.Error_val:
+	MOVAPD %xmm1,%xmm2
+	CMPEQSS	%xmm1,%xmm1			
+	ADDSS %xmm2,%xmm2	
+	
+	PAND %xmm1,%xmm0
+	PANDN %xmm2,%xmm1
+	POR %xmm1,%xmm0
+	
+	
+
+	
+    ret
+
+.data
+.align 16
+.L__Erro_mask:  		.quad 0x8000000000000000
+					    .quad 0x0

diff --git a/src/gas/v4hcosl.S b/src/gas/v4hcosl.S
new file mode 100644
index 0000000..a3ded17
--- /dev/null
+++ b/src/gas/v4hcosl.S

@@ -0,0 +1,62 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hcosl.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+#     void v4cos(__m128d x1, __m128d x2, double * ya);
+#
+#   Computes 4 cos values simultaneously and returns them
+#   in the v4a array.
+#   Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi   - double *ya
+
+.extern __vrd4_cos
+    .text
+    .align 16
+    .p2align 4,,15
+.globl v4cos
+    .type   v4cos,@function
+v4cos:
+        push    %rdi
+        call    __vrd4_cos@PLT
+        pop             %rdi
+        movdqa  %xmm0,(%rdi)
+        movdqa  %xmm1,16(%rdi)
+	ret

diff --git a/src/gas/v4helpl.S b/src/gas/v4helpl.S
new file mode 100644
index 0000000..02fa080
--- /dev/null
+++ b/src/gas/v4helpl.S

@@ -0,0 +1,83 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4help.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+#     void v4exp(__m128d x1, __m128d x2, double * ya);
+#
+#   Computes 4 exp values simultaneously and returns them
+#   in the v4a array.
+#   Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# %xmm0 - __m128d x1
+# %xmm1 - __m128d x2
+# rdi   - double *ya
+
+.extern	__vrd4_exp
+    .text
+    .align 16
+    .p2align 4,,15
+.globl v4exp
+    .type   v4exp,@function
+v4exp:
+	push	%rdi
+	call	__vrd4_exp@PLT
+	pop		%rdi
+	movdqa	%xmm0,(%rdi)
+	movdqa	%xmm1,16(%rdi)
+	ret
+
+
+# %xmm0,%rcx - __m128d x1
+# %xmm1,%rdx - __m128d x2
+# r8   - double *ya
+
+.extern	__vrs8_expf
+    .text
+    .align 16
+    .p2align 4,,15
+.globl v8expf
+    .type   v8expf,@function
+v8expf:
+	push	%rdi
+	call	__vrs8_expf@PLT
+	pop		%rdi
+	movdqa	%xmm0,(%rdi)
+	movdqa	%xmm1,16(%rdi)
+	ret
+

diff --git a/src/gas/v4hfrcpal.S b/src/gas/v4hfrcpal.S
new file mode 100644
index 0000000..d648d9d
--- /dev/null
+++ b/src/gas/v4hfrcpal.S

@@ -0,0 +1,63 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hfrcpal.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+#     void v4frcpa(__m128d x1, __m128d x2, double * ya);
+#
+#   Computes 4 frcpa values simultaneously and returns them
+#   in the v4a array.
+#   Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi   - double *ya
+
+.extern __vrd4_frcpa
+    .text
+    .align 16
+    .p2align 4,,15
+.globl v4frcpa
+    .type   v4frcpa,@function
+v4frcpa:
+        push    %rdi
+        call    __vrd4_frcpa@PLT
+        pop             %rdi
+        movdqa  %xmm0,(%rdi)
+        movdqa  %xmm1,16(%rdi)
+	ret
+

diff --git a/src/gas/v4hlog10l.S b/src/gas/v4hlog10l.S
new file mode 100644
index 0000000..0cdb6ba
--- /dev/null
+++ b/src/gas/v4hlog10l.S

@@ -0,0 +1,81 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hlog10l.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+#     void v4log10(__m128d x1, __m128d x2, double * ya);
+#
+#   Computes 4 log10 values simultaneously and returns them
+#   in the v4a array.
+#   Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi   - double *ya
+
+.extern __vrd4_log10
+    .text
+    .align 16
+    .p2align 4,,15
+.globl v4log10
+    .type   v4log10,@function
+v4log10:
+        push    %rdi
+        call    __vrd4_log10@PLT
+        pop             %rdi
+        movdqa  %xmm0,(%rdi)
+        movdqa  %xmm1,16(%rdi)
+	ret
+
+# xmm0 - __m128 x1
+# xmm1 - __m128 x2
+# rdi   - single *ya
+
+.extern __vrs8_log10f
+    .text
+    .align 16
+    .p2align 4,,15
+.globl v8log10f
+    .type   v8log10f,@function
+v8log10f:
+        push    %rdi
+        call    __vrs8_log10f@PLT
+        pop             %rdi
+        movdqa  %xmm0,(%rdi)
+        movdqa  %xmm1,16(%rdi)
+
+	ret

diff --git a/src/gas/v4hlog2l.S b/src/gas/v4hlog2l.S
new file mode 100644
index 0000000..1a8c33e
--- /dev/null
+++ b/src/gas/v4hlog2l.S

@@ -0,0 +1,81 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hlog10l.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+#     void v4log2(__m128d x1, __m128d x2, double * ya);
+#
+#   Computes 4 log2 values simultaneously and returns them
+#   in the v4a array.
+#   Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi   - double *ya
+
+.extern __vrd4_log2
+    .text
+    .align 16
+    .p2align 4,,15
+.globl v4log2
+    .type   v4log2,@function
+v4log2:
+        push    %rdi
+        call    __vrd4_log2@PLT
+        pop             %rdi
+        movdqa  %xmm0,(%rdi)
+        movdqa  %xmm1,16(%rdi)
+	ret
+
+# xmm0 - __m128 x1
+# xmm1 - __m128 x2
+# rdi   - single *ya
+
+.extern __vrs8_log2f
+    .text
+    .align 16
+    .p2align 4,,15
+.globl v8log2f
+    .type   v8log2f,@function
+v8log2f:
+        push    %rdi
+        call    __vrs8_log2f@PLT
+        pop             %rdi
+        movdqa  %xmm0,(%rdi)
+        movdqa  %xmm1,16(%rdi)
+
+	ret

diff --git a/src/gas/v4hlogl.S b/src/gas/v4hlogl.S
new file mode 100644
index 0000000..512648d
--- /dev/null
+++ b/src/gas/v4hlogl.S

@@ -0,0 +1,84 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hlog.asm
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+#     void v4log(__m128d x1, __m128d x2, double * ya);
+#
+#   Computes 4 log values simultaneously and returns them
+#   in the v4a array.
+#   Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi   - double *ya
+
+.extern __vrd4_log
+    .text
+    .align 16
+    .p2align 4,,15
+.globl v4log
+    .type   v4log,@function
+v4log:
+        push    %rdi
+        call    __vrd4_log@PLT
+        pop             %rdi
+        movdqa  %xmm0,(%rdi)
+        movdqa  %xmm1,16(%rdi)
+	ret
+
+
+# xmm0 - __m128 x1
+# xmm1 - __m128 x2
+# rdi   - double *ya
+
+#.extern __vrs8_logf
+    .text
+    .align 16
+    .p2align 4,,15
+.globl v8logf
+    .type   v8logf,@function
+v8logf:
+        push    %rdi
+        call    __vrs8_logf@PLT
+        pop             %rdi
+        movdqa  %xmm0,(%rdi)
+        movdqa  %xmm1,16(%rdi)
+
+	ret
+

diff --git a/src/gas/v4hsinl.S b/src/gas/v4hsinl.S
new file mode 100644
index 0000000..97bfa2d
--- /dev/null
+++ b/src/gas/v4hsinl.S

@@ -0,0 +1,62 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hsinl.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+#     void v4sin(__m128d x1, __m128d x2, double * ya);
+#
+#   Computes 4 sin values simultaneously and returns them
+#   in the v4a array.
+#   Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi   - double *ya
+
+.extern __vrd4_sin
+    .text
+    .align 16
+    .p2align 4,,15
+.globl v4sin
+    .type   v4sin,@function
+v4sin:
+        push    %rdi
+        call    __vrd4_sin@PLT
+        pop             %rdi
+        movdqa  %xmm0,(%rdi)
+        movdqa  %xmm1,16(%rdi)
+	ret

diff --git a/src/gas/vrd2cos.S b/src/gas/vrd2cos.S
new file mode 100644
index 0000000..d12a156
--- /dev/null
+++ b/src/gas/vrd2cos.S

@@ -0,0 +1,756 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# A vector implementation of the libm cos function.
+#
+# Prototype:
+#
+#     __m128d __vrd2_cos(__m128d x);
+#
+#   Computes Cosine of x
+#   It will provide proper C99 return values,
+#   but may not raise floating point status bits properly.
+#   Based on the NAG C implementation.
+#   Author: Harsha Jagasia
+#   Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff	#Sign bit zero
+			.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000	# 1.0
+			.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+			.quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000	# 0.5
+			.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555	# 0.166666666666
+			.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883	# twobypi
+			.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000	# piby2_1
+			.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331	# piby2_1tail
+			.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000	# piby2_2
+			.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073	# piby2_2tail
+			.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000	# mask for stripping head and tail
+			.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+			.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+			.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+			.quad 0
+.L__reald_one_zero:	.quad 0x00000000100000000	# sin_cos_filter
+			.quad 0
+.L__reald_zero_one:	.quad 0x00000000000000001	#
+			.quad 0
+.L__reald_two_zero:	.quad 0x00000000200000000	#
+			.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+			.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+			.quad 0x03ff0000000000000	#
+.L__real_zero:		.quad 0x00000000000000000	#
+			.quad 0x00000000000000000	#
+.L__real_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+
+.Lcosarray:
+	.quad	0x03fa5555555555555		#  0.0416667		   	c1
+	.quad	0x03fa5555555555555
+	.quad	0x0bf56c16c16c16967		# -0.00138889	   		c2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03efa01a019f4ec90		#  2.48016e-005			c3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0be927e4fa17f65f6		# -2.75573e-007			c4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03e21eeb69037ab78		#  2.08761e-009			c5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0bda907db46cc5e42		# -1.13826e-011	   		c6
+	.quad	0x0bda907db46cc5e42
+.Lsinarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bfc5555555555555
+	.quad	0x03f81111111110bb3		#  0.00833333	   		s2
+	.quad	0x03f81111111110bb3
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0bf2a01a019e83e5c
+	.quad	0x03ec71de3796cde01		#  2.75573e-006			s4
+	.quad	0x03ec71de3796cde01
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0be5ae600b42fdfa7
+	.quad	0x03de5e0b2f9a43bb8		#  1.59181e-010	   		s6
+	.quad	0x03de5e0b2f9a43bb8
+.Lsincosarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x03fa5555555555555		#  0.0416667		   	c1
+	.quad	0x03f81111111110bb3		#  0.00833333	   		s2
+	.quad	0x0bf56c16c16c16967		#				c2
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x03ec71de3796cde01		#  2.75573e-006			s4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x03de5e0b2f9a43bb8		#  1.59181e-010	   		s6
+	.quad	0x0bda907db46cc5e42
+.Lcossinarray:
+	.quad	0x03fa5555555555555		#  0.0416667		   	c1
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bf56c16c16c16967		#				c2
+	.quad	0x03f81111111110bb3		#  0.00833333	   		s2
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03ec71de3796cde01		#  2.75573e-006			s4
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0bda907db46cc5e42
+	.quad	0x03de5e0b2f9a43bb8		#  1.59181e-010	   		s6
+
+.text
+.align 16
+.p2align 4,,15
+
+.equ	p_temp,	0x00		# temporary for get/put bits operation
+.equ	p_temp1,0x10		# temporary for get/put bits operation
+.equ	p_temp2,0x20		# temporary for get/put bits operation
+.equ	p_xmm6,	0x30		# temporary for get/put bits operation
+.equ	p_xmm7,	0x40		# temporary for get/put bits operation
+.equ	p_xmm8,	0x50		# temporary for get/put bits operation
+.equ	p_xmm9,	0x60		# temporary for get/put bits operation
+.equ	p_xmm10,0x70		# temporary for get/put bits operation
+.equ	p_xmm11,0x80		# temporary for get/put bits operation
+.equ	p_xmm12,0x90		# temporary for get/put bits operation
+.equ	p_xmm13,0x0A0		# temporary for get/put bits operation
+.equ	p_xmm14,0x0B0		# temporary for get/put bits operation
+.equ	p_xmm15,0x0C0		# temporary for get/put bits operation
+.equ	r,	0x0D0			# pointer to r for remainder_piby2
+.equ	rr,	0x0E0		# pointer to r for remainder_piby2
+.equ	region,	0x0F0		# pointer to r for remainder_piby2
+.equ	p_original,0x100	# original x
+.equ	p_mask,	0x110		# original x
+.equ	p_sign,	0x120		# original x
+
+.globl __vrd2_cos
+    .type   __vrd2_cos,@function
+__vrd2_cos:
+	sub		$0x138,%rsp
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+movdqa	%xmm0, p_original(%rsp)
+andpd 	.L__real_7fffffffffffffff(%rip),%xmm0
+movdqa 	%xmm0, p_temp(%rsp)
+mov 	$0x3FE921FB54442D18,%rdx			#piby4
+mov	$0x411E848000000000,%r10			#5e5
+movapd	.L__real_v2p__27(%rip),%xmm4			#for later use
+
+movapd	%xmm0,%xmm2					#x
+movapd	%xmm0,%xmm4					#x
+
+mov    	 p_temp(%rsp),%rax				#rax = lower arg
+mov    	 p_temp+8(%rsp),%rcx				#rcx = upper arg
+movapd	.L__real_3fe0000000000000(%rip),%xmm5		#0.5 for later use
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax				#is lower arg >= 5e5
+	jae	.Llower_or_both_arg_gt_5e5
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lupper_arg_gt_5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lboth_arg_lt_than_5e5:
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2	# x*twobypi
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm3	# xmm3=piby2_1
+	addpd	%xmm5,%xmm2				# xmm2 = npi2 = x*twobypi+0.5
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm1	# xmm1=piby2_2
+	movapd	.L__real_3ba3198a2e037073(%rip),%xmm6	# xmm6=piby2_2tail
+	cvttpd2dq	%xmm2,%xmm0			# xmm0=convert npi2 to ints
+	cvtdq2pd	%xmm0,%xmm2			# xmm2=and back to double.
+
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulpd	%xmm2,%xmm3				# npi2 * piby2_1
+	subpd	%xmm3,%xmm4				# xmm4 = rhead=x-npi2*piby2_1
+
+#t  = rhead;
+       movapd	%xmm4,%xmm5				# xmm5=t=rhead
+
+#rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm1				# xmm1= npi2*piby2_2
+
+#rhead  = t - rtail;
+       subpd	%xmm1,%xmm4				# xmm4= rhead = t-rtail
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	%xmm2,%xmm6     				# npi2 * piby2_2tail
+       subpd	%xmm4,%xmm5				# t-rhead
+       subpd	%xmm5,%xmm1				# rtail-(t - rhead)
+       addpd	%xmm6,%xmm1				# rtail=npi2*piby2_2+(rtail-(t-rhead))
+
+#r = rhead - rtail
+#rr=(rhead-r) -rtail
+#Sign
+#Region
+	movdqa		%xmm0,%xmm5			# Sign
+	movdqa		%xmm0,%xmm6			# Region
+	movdqa		%xmm4,%xmm0			# rhead (handle xmm0 retype)
+
+	paddd		.L__reald_one_one(%rip),%xmm6	# Sign
+	pand		.L__reald_two_two(%rip),%xmm6
+	punpckldq 	%xmm6,%xmm6
+	psllq		$62,%xmm6				# xmm6 is in Int format
+
+	subpd	%xmm1,%xmm0				# rhead - rtail
+	pand 	.L__reald_one_one(%rip),%xmm5		# Odd/Even region for Sin/Cos
+	mov 	.L__reald_one_zero(%rip),%r9		# Compare value for sincos
+	subpd	%xmm0,%xmm4				# rr=rhead-r
+	movd 	%xmm5,%r8				# Region
+	movapd	%xmm0,%xmm2				# Move for x2
+	movdqa	%xmm6,%xmm6				# handle xmm6 retype
+	mulpd	%xmm0,%xmm2				# x2
+	subpd	%xmm1,%xmm4				# rr=(rhead-r) -rtail
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign
+
+.align 16
+.L__vrd2_cos_approximate:
+	cmp 	$0,%r8
+	jnz	.Lvrd2_not_cos_piby4
+
+.Lvrd2_cos_piby4:
+	mulpd	%xmm0,%xmm4				# x*xx
+	movdqa	.L__real_3fe0000000000000(%rip),%xmm5	# 0.5 (handle xmm5 retype)
+	movapd	.Lcosarray+0x50(%rip),%xmm1		# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm0		# c3
+	mulpd	%xmm2,%xmm5				# r = 0.5 *x2
+	movapd	%xmm2,%xmm3				# copy of x2 for x4
+	movapd	 %xmm4,p_temp(%rsp)		# store x*xx
+	mulpd	%xmm2,%xmm1				# c6*x2
+	mulpd	%xmm2,%xmm0				# c3*x2
+	subpd	.L__real_3ff0000000000000(%rip),%xmm5	# -t=r-1.0
+	mulpd	%xmm2,%xmm3				# x4
+	addpd	.Lcosarray+0x40(%rip),%xmm1		# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm0		# c2+x2C3
+	addpd   .L__real_3ff0000000000000(%rip),%xmm5	# 1 + (-t)
+	mulpd	%xmm2,%xmm3				# x6
+	mulpd	%xmm2,%xmm1				# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm0				# x2(c2+x2C3)
+	movapd 	%xmm2,%xmm4				# copy of x2
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm4	# r = 0.5 *x2
+	addpd	.Lcosarray+0x30(%rip),%xmm1		# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm0			# c1+x2(c2+x2C3)
+	mulpd	%xmm2,%xmm2				# x4
+	subpd   %xmm4,%xmm5				# (1 + (-t)) - r
+	mulpd	%xmm3,%xmm1				# x6(c4 + x2(c5+x2c6))
+	addpd	%xmm1,%xmm0				# zc
+	subpd	.L__real_3ff0000000000000(%rip),%xmm4	# -t=r-1.0
+	subpd	p_temp(%rsp),%xmm5		# ((1 + (-t)) - r) - x*xx
+	mulpd	%xmm2,%xmm0				# x4 * zc
+	addpd   %xmm5,%xmm0				# x4 * zc + ((1 + (-t)) - r -x*xx)
+	subpd   %xmm4,%xmm0				# result - (-t)
+	xorpd	%xmm6,%xmm0				# xor with sign
+	jmp 	.L__vrd2_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lvrd2_not_cos_piby4:
+	cmp 	$1,%r8
+	jnz	.Lvrd2_not_cos_sin_piby4
+
+.Lvrd2_cos_sin_piby4:
+
+	movdqa	 %xmm6,p_temp1(%rsp)		# Store Sign
+	movapd	 %xmm4,p_temp(%rsp)		# Store rr
+
+	movapd	.Lsincosarray+0x50(%rip),%xmm3		# s6
+	mulpd	%xmm2,%xmm3				# x2s6
+	movdqa	.Lsincosarray+0x20(%rip),%xmm5		# s3 (handle xmm5 retype)
+	movapd	%xmm2,%xmm1				# move x2 for x4
+	mulpd	%xmm2,%xmm1				# x4
+	mulpd	%xmm2,%xmm5				# x2s3
+	addpd	.Lsincosarray+0x40(%rip),%xmm3		# s5+x2s6
+	movapd	%xmm2,%xmm4				# move x2 for x6
+	mulpd	%xmm2,%xmm3				# x2(s5+x2s6)
+	mulpd	%xmm1,%xmm4				# x6
+	addpd	.Lsincosarray+0x10(%rip),%xmm5		# s2+x2s3
+	mulpd	%xmm2,%xmm5				# x2(s2+x2s3)
+	addpd	.Lsincosarray+0x30(%rip),%xmm3		# s4+x2(s5+x2s6)
+
+	movhlps	%xmm1,%xmm1				# move high x4 for cos
+	mulpd	%xmm4,%xmm3				# x6(s4+x2(s5+x2s6))
+	addpd	.Lsincosarray(%rip),%xmm5		# s1+x2(s2+x2s3)
+	movapd	%xmm2,%xmm4				# move low x2 for x3
+	mulsd	%xmm0,%xmm4				# get low x3 for sin term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2
+
+	addpd	%xmm3,%xmm5				# z
+	movhlps	%xmm2,%xmm6				# move high r for cos
+	movhlps	%xmm5,%xmm3				# xmm5 = sin
+							# xmm3 = cos
+
+
+	mulsd	p_temp(%rsp),%xmm2		# 0.5 * x2 * xx
+
+	mulsd	%xmm4,%xmm5				# sin *x3
+	movsd	.L__real_3ff0000000000000(%rip),%xmm4	# 1.0
+	mulsd	%xmm1,%xmm3				# cos *x4
+	subsd	%xmm6,%xmm4 				# t=1.0-r
+
+	movhlps	%xmm0,%xmm1
+	subsd	%xmm2,%xmm5				# sin - 0.5 * x2 *xx
+
+	mulsd	p_temp+8(%rsp),%xmm1			# x * xx
+	movsd	.L__real_3ff0000000000000(%rip),%xmm2	# 1
+	subsd	%xmm4,%xmm2				# 1 - t
+	addsd	p_temp(%rsp),%xmm5			# sin+xx
+
+	subsd	%xmm6,%xmm2				# (1-t) - r
+	subsd	%xmm1,%xmm2				# ((1 + (-t)) - r) - x*xx
+	addsd	%xmm5,%xmm0				# sin + x
+	addsd   %xmm2,%xmm3				# cos+((1-t)-r - x*xx)
+	addsd   %xmm4,%xmm3				# cos+t
+
+	movapd	p_temp1(%rsp),%xmm5		# load sign
+	movlhps %xmm3,%xmm0
+	xorpd	%xmm5,%xmm0
+	jmp .L__vrd2_cos_cleanup
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lvrd2_not_cos_sin_piby4:
+	cmp 	%r9,%r8
+	jnz	.Lvrd2_sin_piby4
+
+.Lvrd2_sin_cos_piby4:
+
+	movapd	 %xmm4,p_temp(%rsp)		# rr move to to memory
+	movapd	 %xmm0,p_temp1(%rsp)		# r move to to memory
+	movapd	 %xmm6,p_sign(%rsp)
+
+	movapd	.Lcossinarray+0x50(%rip),%xmm3		# s6
+	mulpd	%xmm2,%xmm3				# x2s6
+	movdqa	.Lcossinarray+0x20(%rip),%xmm5		# s3
+	movapd	%xmm2,%xmm1				# move x2 for x4
+	mulpd	%xmm2,%xmm1				# x4
+	mulpd	%xmm2,%xmm5				# x2s3
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm3		# s5+x2s6
+	movapd	%xmm2,%xmm4				# move for x6
+	mulpd	%xmm2,%xmm3				# x2(s5+x2s6)
+	mulpd	%xmm1,%xmm4				# x6
+	addpd	.Lcossinarray+0x10(%rip),%xmm5		# s2+x2s3
+	mulpd	%xmm2,%xmm5				# x2(s2+x2s3)
+	addpd	.Lcossinarray+0x30(%rip),%xmm3		# s4 + x2(s5+x2s6)
+
+	movhlps	%xmm0,%xmm0				# high of x for x3
+	mulpd	%xmm4,%xmm3				# x6(s4 + x2(s5+x2s6))
+	addpd	.Lcossinarray(%rip),%xmm5		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm2,%xmm4				# high of x2 for x3
+
+	addpd	%xmm5,%xmm3				# z
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2
+	mulsd	%xmm0,%xmm4				# x3 #
+	movhlps	%xmm3,%xmm5				# xmm5 = sin
+							# xmm3 = cos
+
+	mulsd	%xmm4,%xmm5				# sin*x3 #
+	movsd	.L__real_3ff0000000000000(%rip),%xmm4	# 1.0 #
+	mulsd	%xmm1,%xmm3				# cos*x4 #
+
+	subsd	%xmm2,%xmm4 				# t=1.0-r #
+
+	movhlps	%xmm2,%xmm6				# move 0.5 * x2 for 0.5 * x2 * xx #
+	mulsd	p_temp+8(%rsp),%xmm6		# 0.5 * x2 * xx #
+	subsd	%xmm6,%xmm5				# sin - 0.5 * x2 *xx #
+	addsd	p_temp+8(%rsp),%xmm5		# sin+xx #
+
+	movlpd	p_temp1(%rsp),%xmm6		# x
+	mulsd	p_temp(%rsp),%xmm6		# x *xx #
+
+	movsd	.L__real_3ff0000000000000(%rip),%xmm1	# 1 #
+	subsd	%xmm4,%xmm1				# 1 -t #
+	addsd	%xmm5,%xmm0				# sin+x #
+	subsd	%xmm2,%xmm1				# (1-t) - r #
+	subsd	%xmm6,%xmm1				# ((1 + (-t)) - r) - x*xx #
+	addsd   %xmm1,%xmm3				# cos+((1 + (-t)) - r) - x*xx #
+	addsd   %xmm4,%xmm3				# cos+t #
+
+	movapd	p_sign(%rsp),%xmm2		# load sign
+	movlhps %xmm0,%xmm3
+	movapd	%xmm3,%xmm0
+	xorpd	%xmm2,%xmm0
+	jmp 	.L__vrd2_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lvrd2_sin_piby4:
+	movapd	.Lsinarray+0x50(%rip),%xmm3	# s6
+	mulpd	%xmm2,%xmm3			# x2s6
+	movapd	.Lsinarray+0x20(%rip),%xmm5	# s3
+	movapd	 %xmm4,p_temp(%rsp)	# store xx
+	movapd	%xmm2,%xmm1			# move for x4
+	mulpd	%xmm2,%xmm1			# x4
+	movapd	 %xmm0,p_temp1(%rsp)	# store x
+
+	mulpd	%xmm2,%xmm5			# x2s3
+	movapd 	%xmm0,%xmm4			# move for x3
+	addpd	.Lsinarray+0x40(%rip),%xmm3	# s5+x2s6
+	mulpd	%xmm2,%xmm1			# x6
+	mulpd	%xmm2,%xmm3			# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm4			# x3
+	addpd	.Lsinarray+0x10(%rip),%xmm5	# s2+x2s3
+	mulpd	%xmm2,%xmm5			# x2(s2+x2s3)
+	addpd	.Lsinarray+0x30(%rip),%xmm3	# s4 + x2(s5+x2s6)
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2
+
+	movapd	p_temp(%rsp),%xmm0		# load xx
+	mulpd	%xmm1,%xmm3			# x6(s4 + x2(s5+x2s6))
+	addpd	.Lsinarray(%rip),%xmm5		# s1+x2(s2+x2s3)
+	mulpd	%xmm0,%xmm2			# 0.5 * x2 *xx
+	addpd	%xmm5,%xmm3			# zs
+	mulpd	%xmm3,%xmm4			# *x3
+	subpd	%xmm2,%xmm4			# x3*zs - 0.5 * x2 *xx
+	addpd	%xmm4,%xmm0			# +xx
+	addpd	p_temp1(%rsp),%xmm0	# +x
+
+	xorpd	%xmm6,%xmm0			# xor sign
+	jmp 	.L__vrd2_cos_cleanup
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Llower_or_both_arg_gt_5e5:
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+
+	movlpd	 %xmm0,r(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	 %xmm0,%xmm0			#Needed since we want to work on upper arg
+	movhlps	 %xmm2,%xmm2
+	movhlps	 %xmm4,%xmm4
+
+# Work on Upper arg
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+
+#If upper Arg is <=piby4
+       cmp	%rdx,%rcx					# is upper arg > piby4
+       ja	0f
+
+       mov 	$0,%ecx						# region = 0
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movlpd	 %xmm0,r+8(%rsp)			# store upper r
+       xorpd	%xmm4,%xmm4					# rr = 0
+       movlpd	 %xmm4,rr+8(%rsp)			# store upper rr
+       jmp	.Lcheck_lower_arg
+
+.align 16
+0:
+#If upper Arg is > piby4
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm5,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm3		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# xmm0 = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm1		# xmm1 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm3					# npi2 * piby2_1
+	subsd	%xmm3,%xmm4					# xmm4 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm4,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm1					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm1,%xmm4					# xmm4 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm4,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm1					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm1					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	%xmm4,%xmm0
+       subsd	%xmm1,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm0,%xmm4					# rr=rhead-r
+       subsd	%xmm1,%xmm4					# xmm4 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r+8(%rsp)				# store upper r
+       movlpd	 %xmm4,rr+8(%rsp)				# store upper rr
+
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_lower_arg:
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd2_cos_lower_naninf
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+	movlpd	 r(%rsp),%xmm0	#Restore lower fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd2_cos_lower_naninf:
+	mov	p_original(%rsp),%rax			# upper arg is nan/inf
+
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				# rr = 0
+	mov	 %r10d,region(%rsp)			# region =0
+
+.align 16
+0:
+	jmp .L__vrd2_cos_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lupper_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+	movhpd	 %xmm0,r+8(%rsp)		#Save upper fp arg for remainder_piby2 call
+	movlhps	%xmm0,%xmm0			#Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+	movlhps	%xmm2,%xmm2
+	movlhps	%xmm4,%xmm4
+
+
+# Work on Lower arg
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+#If lower Arg is <=piby4
+       cmp	%rdx,%rax					# is upper arg > piby4
+       ja	0f
+
+       mov 	$0,%eax						# region = 0
+       mov	 %eax,region(%rsp)			# store upper region
+       movlpd	 %xmm0,r(%rsp)				# store upper r
+       xorpd	%xmm4,%xmm4					# rr = 0
+       movlpd	 %xmm4,rr(%rsp)				# store upper rr
+       jmp 	.Lcheck_upper_arg
+
+.align 16
+0:
+#If upper Arg is > piby4
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm5,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm3		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# xmm0 = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm1		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm3					# npi2 * piby2_1
+	subsd	%xmm3,%xmm4					# xmm4 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm4,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm1					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm1,%xmm4					# xmm4 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm4,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm1					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm1					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store lower region
+       movsd	%xmm4,%xmm0
+       subsd	%xmm1,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm0,%xmm4					# rr=rhead-r
+       subsd	%xmm1,%xmm4					# xmm4 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r(%rsp)				# store lower r
+       movlpd	 %xmm4,rr(%rsp)				# store lower rr
+
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_upper_arg:
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd2_cos_upper_naninf
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd2_cos_upper_naninf:
+	mov	p_original+8(%rsp),%rcx		# upper arg is nan/inf
+
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	%rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr+8(%rsp)			# rr = 0
+	mov	%r10d,region+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp .L__vrd2_cos_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+
+#	movhlps %xmm0, %xmm6		#Save upper fp arg for remainder_piby2 call
+	movhpd  %xmm0, p_temp1(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd2_cos_lower_naninf_of_both_gt_5e5
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+        call    __amd_remainder_piby2@PLT
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	jmp 	0f
+
+.L__vrd2_cos_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+	mov	p_original(%rsp),%rax
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				#rr = 0
+	mov	 %r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd2_cos_upper_naninf_of_both_gt_5e5
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 p_temp1(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd2_cos_upper_naninf_of_both_gt_5e5:
+	mov	p_original+8(%rsp),%rcx		#upper arg is nan/inf
+#	movd	%xmm6,%rcx					;upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr+8(%rsp)			#rr = 0
+	mov	 %r10d,region+4(%rsp)			#region = 0
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+0:
+.L__vrd2_cos_reconstruct:
+#Construct xmm0=x, xmm2 =x2, xmm4=xx, r8=region, xmm6=sign
+	movapd	r(%rsp),%xmm0				#x
+	movapd	%xmm0,%xmm2					#move for x2
+	mulpd	%xmm2,%xmm2					#x2
+
+	movapd	rr(%rsp),%xmm4				#xx
+
+	mov	region(%rsp),%r8
+	mov 	.L__reald_one_zero(%rip),%r9		#compare value for sincos path
+	mov 	%r8,%r10
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for sin/cos
+	add	.L__reald_one_one(%rip),%r10
+	and	.L__reald_two_two(%rip),%r10
+	mov	%r10,%r11
+	and	.L__reald_two_zero(%rip),%r11		#mask out the lower sign bit leaving the upper sign bit
+	shl	$62,%r10				#shift lower sign bit left by 63 bits
+	shl	$30,%r11				#shift upper sign bit left by 31 bits
+	mov 	 %r10,p_temp(%rsp)		#write out lower sign bit
+	mov 	 %r11,p_temp+8(%rsp)		#write out upper sign bit
+	movapd	p_temp(%rsp),%xmm6		#write out both sign bits to xmm6
+
+	jmp 	.L__vrd2_cos_approximate
+
+#ENDMAIN
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd2_cos_cleanup:
+	add	$0x138,%rsp
+	ret

diff --git a/src/gas/vrd2exp.S b/src/gas/vrd2exp.S
new file mode 100644
index 0000000..b87763f
--- /dev/null
+++ b/src/gas/vrd2exp.S

@@ -0,0 +1,372 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# exp.asm
+#
+# A vector implementation of the exp libm function.
+#
+# Prototype:
+#
+#     __m128d __vrd2_exp(__m128d x);
+#
+#   Computes e raised to the x power.
+# Does not perform error checking.   Denormal results are truncated to 0.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+        .text
+        .align 16
+        .p2align 4,,15
+
+# define local variable storage offsets
+.equ    p_temp,0        # temporary for get/put bits operation
+.equ    p_temp1,0x10        # temporary for get/put bits operation
+.equ    stack_size,0x28
+
+
+
+
+.globl __vrd2_exp
+       .type   __vrd2_exp,@function
+__vrd2_exp:
+        sub     $stack_size,%rsp
+
+
+
+#        /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+#      Step 1. Reduce the argument.
+	#    r = x * thirtytwo_by_logbaseof2;
+	movapd	%xmm0,p_temp(%rsp)
+	movapd	.L__real_thirtytwo_by_log2(%rip),%xmm3	#
+	maxpd	.L__real_C0F0000000000000(%rip),%xmm0	# protect against very large negative, non-infinite numbers
+	mulpd	%xmm0,%xmm3
+
+# save x for later.
+        minpd   .L__real_40F0000000000000(%rip),%xmm3    # protect against very large, non-infinite numbers
+
+#    /* Set n = nearest integer to r */
+	cvtpd2dq	%xmm3,%xmm4
+	lea		.L__two_to_jby32_lead_table(%rip),%rdi
+	lea		.L__two_to_jby32_trail_table(%rip),%rsi
+	cvtdq2pd	%xmm4,%xmm1
+
+ #    r1 = x - n * logbaseof2_by_32_lead;
+	movapd	.L__real_log2_by_32_lead(%rip),%xmm2	#
+	mulpd	%xmm1,%xmm2				#
+	movq	 %xmm4,p_temp1(%rsp)
+	subpd	%xmm2,%xmm0	 			# r1 in xmm0,
+
+#    r2 =   - n * logbaseof2_by_32_trail;
+	mulpd	 .L__real_log2_by_32_tail(%rip),%xmm1 	# r2 in xmm1
+#    j = n & 0x0000001f;
+	mov		$0x01f,%r9
+	mov		%r9,%r8
+	mov		p_temp1(%rsp),%ecx
+	and		%ecx,%r9d
+
+	mov		p_temp1+4(%rsp),%edx
+	and		%edx,%r8d
+	movapd	%xmm0,%xmm2
+#    f1 = two_to_jby32_lead_table[j];
+#    f2 = two_to_jby32_trail_table[j];
+
+#    *m = (n - j) / 32;
+	sub		%r9d,%ecx
+	sar		$5,%ecx			#m
+	sub		%r8d,%edx
+	sar		$5,%edx
+
+
+	addpd	%xmm1,%xmm2    #r = r1 + r2
+
+#      Step 2. Compute the polynomial.
+#    q = r1 + (r2 +
+#              r*r*( 5.00000000000000008883e-01 +
+#                      r*( 1.66666666665260878863e-01 +
+#                      r*( 4.16666666662260795726e-02 +
+#                      r*( 8.33336798434219616221e-03 +
+#                      r*( 1.38889490863777199667e-03 ))))));
+#    q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+	movapd	%xmm2,%xmm1
+	movapd	 .L__real_3f56c1728d739765(%rip),%xmm3	# 	1/720
+	movapd	 .L__real_3FC5555555548F7C(%rip),%xmm0 	# 	1/6
+# deal with infinite results
+	mov		$1024,%rax
+	movsx	%ecx,%rcx
+	cmp		%rax,%rcx
+
+	mulpd	%xmm2,%xmm3			# *x
+	mulpd	%xmm2,%xmm0			# *x
+	mulpd	%xmm2,%xmm1			# x*x
+	movapd	%xmm1,%xmm4
+
+	cmovg	%rax,%rcx			## if infinite, then set rcx to multiply
+						# by infinity
+	movsx	%edx,%rdx
+	cmp	%rax,%rdx
+
+	addpd	 .L__real_3F811115B7AA905E(%rip),%xmm3 	# 	+ 1/120
+	addpd	 .L__real_3fe0000000000000(%rip),%xmm0 	# 	+ .5
+	mulpd	%xmm1,%xmm4								# x^4
+	mulpd	%xmm2,%xmm3								# *x
+
+	cmovg	%rax,%rdx			## if infinite, then set rcx to multiply
+						# by infinity
+# deal with denormal results
+	xor		%rax,%rax
+	add		$1023,%rcx		# add bias
+
+	mulpd	%xmm1,%xmm0			# *x^2
+	addpd	 .L__real_3FA5555555545D4E(%rip),%xmm3 	# 	+ 1/24
+	addpd	%xmm2,%xmm0			# 	+ x
+	mulpd	%xmm4,%xmm3			# *x^4
+# check for infinity or nan
+	movapd	 p_temp(%rsp),%xmm2
+	cmovs	%rax,%rcx			## if denormal, then multiply by 0
+	shl		$52,%rcx		# build 2^n
+
+	addpd	%xmm3,%xmm0			# q = final sum
+
+#    *z2 = f2 + ((f1 + f2) * q);
+	movlpd	(%rsi,%r9,8),%xmm5 		# f2
+	movlpd	(%rsi,%r8,8),%xmm4 		# f2
+	addsd	(%rdi,%r9,8),%xmm5		# f1 + f2
+
+	addsd	(%rdi,%r8,8),%xmm4		# f1 + f2
+	shufpd	$0,%xmm4,%xmm5
+
+
+	mulpd	%xmm5,%xmm0
+	add		$1023,%rdx		# add bias
+	cmovs	%rax,%rdx			## if denormal, then multiply by 0
+	addpd	%xmm5,%xmm0			#z = z1 + z2
+# end of splitexp
+#        /* Scale (z1 + z2) by 2.0**m */
+#          r = scaleDouble_1(z, n);
+
+#;;; the following code moved to improve scheduling
+# deal with infinite results
+#	mov		$1024,%rax
+#	movsxd	%ecx,%rcx
+#	cmp		%rax,%rcx
+#	cmovg	%rax,%rcx					; if infinite, then set rcx to multiply
+									# by infinity
+#	movsxd	%edx,%rdx
+#	cmp		%rax,%rdx
+#	cmovg	%rax,%rdx					; if infinite, then set rcx to multiply
+									# by infinity
+
+# deal with denormal results
+#	xor		%rax,%rax
+#	add		$1023,%rcx				; add bias
+#	shl		$52,%rcx					; build 2^n
+
+#	add		$1023,%rdx				; add bias
+	shl		$52,%rdx					# build 2^n
+
+# check for infinity or nan
+#	movapd	 p_temp(%rsp),%xmm2
+	andpd	 .L__real_infinity(%rip),%xmm2
+	cmppd	 $0,.L__real_infinity(%rip),%xmm2
+	mov		%rcx,p_temp1(%rsp) 	# get 2^n to memory
+	mov		%rdx,p_temp1+8(%rsp) 	# get 2^n to memory
+	movmskpd	%xmm2,%r8d
+	test		$3,%r8d
+
+#      Step 3. Reconstitute.
+
+	mulpd	 p_temp1(%rsp),%xmm0 	# result*= 2^n
+
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them.  But it adds cycles for normal cases which
+# are supposed to be exceptions.  Using this branch with the
+# check above results in faster code for the normal cases.
+	jnz			.L__exp_naninf
+
+#
+#
+.L__final_check:
+	add		$stack_size,%rsp
+	ret
+
+# at least one of the numbers needs special treatment
+.L__exp_naninf:
+# check the first number
+	test	$1,%r8d
+	jz		.L__check2
+
+	mov		p_temp(%rsp),%rdx
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__enan1		# jump if mantissa not zero, so it's a NaN
+# inf
+	mov		%rdx,%rax
+	rcl		$1,%rax
+	jnc		.L__r1			# exp(+inf) = inf
+	xor		%rdx,%rdx		# exp(-inf) = 0
+	jmp		.L__r1
+
+#NaN
+.L__enan1:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__r1:
+	movd	%rdx,%xmm2
+	shufpd	$2,%xmm0,%xmm2
+	movsd	%xmm2,%xmm0
+# check the second number
+.L__check2:
+	test	$2,%r8d
+	jz		.L__final_check
+	mov		p_temp+8(%rsp),%rdx
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__enan2		# jump if mantissa not zero, so it's a NaN
+# inf
+	mov		%rdx,%rax
+	rcl		$1,%rax
+	jnc		.L__r2			# exp(+inf) = inf
+	xor		%rdx,%rdx		# exp(-inf) = 0
+	jmp		.L__r2
+
+#NaN
+.L__enan2:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__r2:
+	movd	%rdx,%xmm2
+	shufpd	$0,%xmm2,%xmm0
+	jmp		.L__final_check
+
+	.data
+        .align 16
+.L__real_3ff0000000000000: 	.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000	# for alignment
+.L__real_4040000000000000:	.quad 0x04040000000000000	# 32
+				.quad 0x04040000000000000
+.L__real_40F0000000000000:      .quad 0x040F0000000000000        # 65536, to protect agains t really large numbers
+				.quad 0x040F0000000000000
+.L__real_C0F0000000000000:	.quad 0x0C0F0000000000000	# -65536, to protect against really large negative numbers
+				.quad 0x0C0F0000000000000
+.L__real_3FA0000000000000:	.quad 0x03FA0000000000000	# 1/32
+				.quad 0x03FA0000000000000
+.L__real_3fe0000000000000:	.quad 0x03fe0000000000000	# 1/2
+				.quad 0x03fe0000000000000
+.L__real_infinity:		.quad 0x07ff0000000000000	#
+				.quad 0x07ff0000000000000	# for alignment
+.L__real_ninfinity:		.quad 0x0fff0000000000000	#
+				.quad 0x0fff0000000000000	# for alignment
+.L__real_thirtytwo_by_log2: 	.quad 0x040471547652b82fe	# thirtytwo_by_log2
+				.quad 0x040471547652b82fe
+.L__real_log2_by_32_lead:	.quad 0x03f962e42fe000000	# log2_by_32_lead
+				.quad 0x03f962e42fe000000
+.L__real_log2_by_32_tail:	.quad 0x0Bdcf473de6af278e	# -log2_by_32_tail
+				.quad 0x0Bdcf473de6af278e
+.L__real_3f56c1728d739765:	.quad 0x03f56c1728d739765	# 1.38889490863777199667e-03
+				.quad 0x03f56c1728d739765
+.L__real_3F811115B7AA905E:	.quad 0x03F811115B7AA905E	# 8.33336798434219616221e-03
+				.quad 0x03F811115B7AA905E
+.L__real_3FA5555555545D4E:	.quad 0x03FA5555555545D4E	# 4.16666666662260795726e-02
+				.quad 0x03FA5555555545D4E
+.L__real_3FC5555555548F7C:	.quad 0x03FC5555555548F7C	# 1.66666666665260878863e-01
+				.quad 0x03FC5555555548F7C
+
+
+.L__two_to_jby32_lead_table:
+	.quad	0x03ff0000000000000 # 1
+	.quad	0x03ff059b0d0000000		# 1.0219
+	.quad	0x03ff0b55860000000		# 1.04427
+	.quad	0x03ff11301d0000000		# 1.06714
+	.quad	0x03ff172b830000000		# 1.09051
+	.quad	0x03ff1d48730000000		# 1.11439
+	.quad	0x03ff2387a60000000		# 1.13879
+	.quad	0x03ff29e9df0000000		# 1.16372
+	.quad	0x03ff306fe00000000		# 1.18921
+	.quad	0x03ff371a730000000		# 1.21525
+	.quad	0x03ff3dea640000000		# 1.24186
+	.quad	0x03ff44e0860000000		# 1.26905
+	.quad	0x03ff4bfdad0000000		# 1.29684
+	.quad	0x03ff5342b50000000		# 1.32524
+	.quad	0x03ff5ab07d0000000		# 1.35426
+	.quad	0x03ff6247eb0000000		# 1.38391
+	.quad	0x03ff6a09e60000000		# 1.41421
+	.quad	0x03ff71f75e0000000		# 1.44518
+	.quad	0x03ff7a11470000000		# 1.47683
+	.quad	0x03ff8258990000000		# 1.50916
+	.quad	0x03ff8ace540000000		# 1.54221
+	.quad	0x03ff93737b0000000		# 1.57598
+	.quad	0x03ff9c49180000000		# 1.61049
+	.quad	0x03ffa5503b0000000		# 1.64576
+	.quad	0x03ffae89f90000000		# 1.68179
+	.quad	0x03ffb7f76f0000000		# 1.71862
+	.quad	0x03ffc199bd0000000		# 1.75625
+	.quad	0x03ffcb720d0000000		# 1.79471
+	.quad	0x03ffd5818d0000000		# 1.83401
+	.quad	0x03ffdfc9730000000		# 1.87417
+	.quad	0x03ffea4afa0000000		# 1.91521
+	.quad	0x03fff507650000000		# 1.95714
+	.quad 0					# for alignment
+.L__two_to_jby32_trail_table:
+	.quad	0x00000000000000000 # 0
+	.quad	0x03e48ac2ba1d73e2a		# 1.1489e-008
+	.quad	0x03e69f3121ec53172		# 4.83347e-008
+	.quad	0x03df25b50a4ebbf1b		# 2.67125e-010
+	.quad	0x03e68faa2f5b9bef9		# 4.65271e-008
+	.quad	0x03e368b9aa7805b80		# 5.24924e-009
+	.quad	0x03e6ceac470cd83f6		# 5.38622e-008
+	.quad	0x03e547f7b84b09745		# 1.90902e-008
+	.quad	0x03e64636e2a5bd1ab		# 3.79764e-008
+	.quad	0x03e5ceaa72a9c5154		# 2.69307e-008
+	.quad	0x03e682468446b6824		# 4.49684e-008
+	.quad	0x03e18624b40c4dbd0		# 1.41933e-009
+	.quad	0x03e54d8a89c750e5e		# 1.94147e-008
+	.quad	0x03e5a753e077c2a0f		# 2.46409e-008
+	.quad	0x03e6a90a852b19260		# 4.94813e-008
+	.quad	0x03e0d2ac258f87d03		# 8.48872e-010
+	.quad	0x03e59fcef32422cbf		# 2.42032e-008
+	.quad	0x03e61d8bee7ba46e2		# 3.3242e-008
+	.quad	0x03e4f580c36bea881		# 1.45957e-008
+	.quad	0x03e62999c25159f11		# 3.46453e-008
+	.quad	0x03e415506dadd3e2a		# 8.0709e-009
+	.quad	0x03e29b8bc9e8a0388		# 2.99439e-009
+	.quad	0x03e451f8480e3e236		# 9.83622e-009
+	.quad	0x03e41f12ae45a1224		# 8.35492e-009
+	.quad	0x03e62b5a75abd0e6a		# 3.48493e-008
+	.quad	0x03e47daf237553d84		# 1.11085e-008
+	.quad	0x03e6b0aa538444196		# 5.03689e-008
+	.quad	0x03e69df20d22a0798		# 4.81896e-008
+	.quad	0x03e69f7490e4bb40b		# 4.83654e-008
+	.quad	0x03e4bdcdaf5cb4656		# 1.29746e-008
+	.quad	0x03e452486cc2c7b9d		# 9.84533e-009
+	.quad	0x03e66dc8a80ce9f09		# 4.25828e-008
+	.quad 0					# for alignment
+

diff --git a/src/gas/vrd2log.S b/src/gas/vrd2log.S
new file mode 100644
index 0000000..30bb3b1
--- /dev/null
+++ b/src/gas/vrd2log.S

@@ -0,0 +1,573 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd2log.s
+#
+# An implementation of the log libm function.
+#
+# Prototype:
+#
+#     __m128d __vrd2_log(__m128d x);
+#
+#   Computes the natural log of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.  Runs 105-115 cycles for valid inputs.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+.equ	p_idx,0x010		# index storage
+
+.equ	stack_size,0x028
+
+
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrd2_log
+    .type   __vrd2_log,@function
+__vrd2_log:
+	sub		$stack_size,%rsp
+
+	movdqa	%xmm0,p_x(%rsp)	# save the input values
+
+
+#      /* Store the exponent of x in xexp and put
+#         f into the range [0.5,1) */
+
+#
+# compute the index into the log tables
+#
+	pxor	%xmm1,%xmm1
+	movdqa	%xmm0,%xmm3
+	psrlq	$52,%xmm3
+	psubq	.L__mask_1023(%rip),%xmm3
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm6			# xexp
+	movdqa	%xmm0,%xmm2
+	xor		%rax,%rax
+	subpd	.L__real_one(%rip),%xmm2
+
+	movdqa	%xmm0,%xmm3
+	andpd	.L__real_notsign(%rip),%xmm2
+	pand	.L__real_mant(%rip),%xmm3
+	movdqa	%xmm3,%xmm4
+	movapd	.L__real_half(%rip),%xmm5							# .5
+
+	cmppd	$1,.L__real_threshold(%rip),%xmm2
+	movmskpd	%xmm2,%r10d
+	cmp		$3,%r10d
+	jz		.Lall_nearone
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrlq	$45,%xmm3
+	movdqa	%xmm3,%xmm2
+	psrlq	$1,%xmm3
+	paddq	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm2
+	paddq	%xmm2,%xmm3
+
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm1
+	xor		%rcx,%rcx
+	movq	 %xmm3,p_idx(%rsp)
+
+# reduce and get u
+	por		.L__real_half(%rip),%xmm4
+	movdqa	%xmm4,%xmm2
+
+
+	mulpd	.L__real_3f80000000000000(%rip),%xmm1		# f1 = index/128
+
+
+	lea	.L__np_ln_lead_table(%rip),%rdx
+	mov	p_idx(%rsp),%eax
+
+	subpd	%xmm1,%xmm2				# f2 = f - f1
+	mulpd	%xmm2,%xmm5
+	addpd	%xmm5,%xmm1
+
+	divpd	%xmm1,%xmm2				# u
+
+# do error checking here for scheduling.  Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+##  if NaN or inf
+	movapd	%xmm0,%xmm3
+	andpd	.L__real_inf(%rip),%xmm3
+	cmppd	$0,.L__real_inf(%rip),%xmm3
+	movmskpd	%xmm3,%r8d
+	xorpd	%xmm1,%xmm1
+
+	cmppd	$2,%xmm1,%xmm0
+	movmskpd	%xmm0,%r9d
+
+# get z
+	movlpd	 -512(%rdx,%rax,8),%xmm0		# z1
+	mov		p_idx+4(%rsp),%ecx
+	movhpd	 -512(%rdx,%rcx,8),%xmm0		# z1
+# solve for ln(1+u)
+	movapd	%xmm2,%xmm1				# u
+	mulpd	%xmm2,%xmm2				# u^2
+	movapd	%xmm2,%xmm5
+	movapd	.L__real_cb3(%rip),%xmm3
+	mulpd	%xmm2,%xmm3				#Cu2
+	mulpd	%xmm1,%xmm5				# u^3
+	addpd	.L__real_cb2(%rip),%xmm3 		#B+Cu2
+
+	mulpd	%xmm5,%xmm2				# u^5
+	movapd	.L__real_log2_lead(%rip),%xmm4
+
+	mulpd	.L__real_cb1(%rip),%xmm5 		#Au3
+	addpd	%xmm5,%xmm1				# u+Au3
+	mulpd	%xmm3,%xmm2				# u5(B+Cu2)
+
+	addpd	%xmm2,%xmm1				# poly
+# recombine
+	mulpd	%xmm6,%xmm4				# xexp * log2_lead
+	addpd	%xmm4,%xmm0				#r1
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	movlpd	 -512(%rdx,%rax,8),%xmm4		#z2	+=q
+	movhpd	 -512(%rdx,%rcx,8),%xmm4		#z2	+=q
+	addpd	%xmm4,%xmm1
+
+	mulpd	.L__real_log2_tail(%rip),%xmm6
+
+	addpd	%xmm6,%xmm1				#r2
+
+# check for nans/infs
+	test		$3,%r8d
+	addpd	%xmm1,%xmm0
+	jnz		.L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+	test		$3,%r9d
+	jnz		.L__z_or_n
+
+.L__finish:
+# see if we have a near one value
+	test		 $3,%r10d
+	jnz		.L__near_one
+.L__finishn1:
+	add		$stack_size,%rsp
+	ret
+
+	.align	16
+.Lall_nearone:
+# saves 10 cycles
+#      r = x - 1.0;
+	movapd	.L__real_two(%rip),%xmm2
+	subpd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addpd	%xmm0,%xmm2
+	movapd	%xmm0,%xmm1
+	divpd	%xmm2,%xmm1			# u
+	movapd	.L__real_ca4(%rip),%xmm4	  #D
+	movapd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movapd	%xmm0,%xmm6
+	mulpd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addpd	%xmm1,%xmm1		#u
+	movapd	%xmm1,%xmm2
+	mulpd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulpd	%xmm1,%xmm5		# Cu
+	movapd	%xmm1,%xmm3
+	mulpd	%xmm2,%xmm3		# u^3
+	mulpd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulpd	%xmm3,%xmm4		#Du^3
+
+	addpd	.L__real_ca1(%rip),%xmm2	# +A
+	movapd	%xmm3,%xmm1
+	mulpd	%xmm1,%xmm1		# u^6
+	addpd	%xmm4,%xmm5		#Cu+Du3
+#	subsd	%xmm6,%xmm0		; -correction
+
+	mulpd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulpd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addpd	%xmm1,%xmm2
+	subpd	%xmm6,%xmm2		# -correction
+
+#      return r + r2;
+	addpd	%xmm2,%xmm0
+	jmp		.L__finishn1
+
+	.align	16
+.L__near_one:
+	test	$1,%r10d
+	jz		.L__lnn12
+
+#	movapd	%xmm0,%xmm6		; save the inputs
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__ln1
+#	shufpd	xmm0,$2,%xmm6
+
+.L__lnn12:
+	test	$2,%r10d		# second number?
+	jz		.L__lnn1e
+	movlpd	%xmm0,p_x(%rsp)
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,p_x+8(%rsp)
+	movapd	p_x(%rsp),%xmm0
+#	shufpd	xmm6,$0,%xmm0
+#	movapd	%xmm6,%xmm0
+
+.L__lnn1e:
+	jmp		.L__finishn1
+
+.L__ln1:
+# saves 10 cycles
+#      r = x - 1.0;
+	movlpd	.L__real_two(%rip),%xmm2
+	subsd	.L__real_one(%rip),%xmm0	# r
+#      u          = r / (2.0 + r);
+	addsd	%xmm0,%xmm2
+	movsd	%xmm0,%xmm1
+	divsd	%xmm2,%xmm1			# u
+	movlpd	.L__real_ca4(%rip),%xmm4	#D
+	movlpd	.L__real_ca3(%rip),%xmm5	#C
+#      correction = r * u;
+	movsd	%xmm0,%xmm6
+	mulsd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addsd	%xmm1,%xmm1		#u
+	movsd	%xmm1,%xmm2
+	mulsd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulsd	%xmm1,%xmm5		# Cu
+	movsd	%xmm1,%xmm3
+	mulsd	%xmm2,%xmm3		# u^3
+	mulsd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulsd	%xmm3,%xmm4		#Du^3
+
+	addsd	.L__real_ca1(%rip),%xmm2	# +A
+	movsd	%xmm3,%xmm1
+	mulsd	%xmm1,%xmm1		# u^6
+	addsd	%xmm4,%xmm5		#Cu+Du3
+#	subsd	%xmm6,%xmm0		; -correction
+
+	mulsd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulsd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addsd	%xmm1,%xmm2
+	subsd	%xmm6,%xmm2		# -correction
+
+#      return r + r2;
+	addsd	%xmm2,%xmm0
+	ret
+
+	.align	16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+	test		$1,%r8d		# first number?
+	jz		.L__lninf2
+
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rdx
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm1,%xmm0
+
+.L__lninf2:
+	test		$2,%r8d		# second number?
+	jz		.L__lninfe
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov	p_x+8(%rsp),%rdx
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+
+.L__lninfe:
+
+	cmp		$3,%r8d		# both numbers?
+	jz		.L__finish	# return early if so
+	jmp		.L__vlog1	# continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__lnan	# jump if mantissa not zero, so it's a NaN
+# inf
+	rcl		$1,%rdx
+	jnc		.L__lne2	# log(+inf) = inf
+# negative x
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+#NaN
+.L__lnan:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__lne:
+	movd	%rdx,%xmm0
+.L__lne2:
+	ret
+
+	.align	16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+	test		$1,%r9d		# first number?
+	jz		.L__zn2
+
+#	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rax
+	call	.L__zni
+#	shufpd	$2,%xmm1,%xmm0
+
+.L__zn2:
+	test		$2,%r9d		# second number?
+	jz		.L__zne
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+
+.L__zne:
+	jmp		.L__finish
+
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+	shl		$1,%rax
+	jnz		.L__zn_x		 ## if just a carry, then must be negative
+	movlpd	.L__real_ninf(%rip),%xmm0  # C99 specs -inf for +-0
+	ret
+.L__zn_x:
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+
+
+
+	.data
+	.align	16
+
+.L__real_one:			.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_two:			.quad 0x04000000000000000	# 2.0
+				.quad 0x04000000000000000
+.L__real_ninf:			.quad 0x0fff0000000000000	# -inf
+				.quad 0x0fff0000000000000
+.L__real_inf:			.quad 0x07ff0000000000000	# +inf
+				.quad 0x07ff0000000000000
+.L__real_nan:			.quad 0x07ff8000000000000	# NaN
+				.quad 0x07ff8000000000000
+
+.L__real_sign:			.quad 0x08000000000000000	# sign bit
+				.quad 0x08000000000000000
+.L__real_notsign:		.quad 0x07ffFFFFFFFFFFFFF	# ^sign bit
+				.quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold:		.quad 0x03F9EB85000000000	# .03
+				.quad 0x03F9EB85000000000
+.L__real_qnanbit:		.quad 0x00008000000000000	# quiet nan bit
+				.quad 0x00008000000000000
+.L__real_mant:			.quad 0x0000FFFFFFFFFFFFF	# mantissa bits
+				.quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000:	.quad 0x03f80000000000000	# /* 0.0078125 = 1/128 */
+				.quad 0x03f80000000000000
+.L__mask_1023:			.quad 0x000000000000003ff	#
+				.quad 0x000000000000003ff
+.L__mask_040:			.quad 0x00000000000000040	#
+				.quad 0x00000000000000040
+.L__mask_001:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001
+
+.L__real_ca1:			.quad 0x03fb55555555554e6	# 8.33333333333317923934e-02
+				.quad 0x03fb55555555554e6
+.L__real_ca2:			.quad 0x03f89999999bac6d4	# 1.25000000037717509602e-02
+				.quad 0x03f89999999bac6d4
+.L__real_ca3:			.quad 0x03f62492307f1519f	# 2.23213998791944806202e-03
+				.quad 0x03f62492307f1519f
+.L__real_ca4:			.quad 0x03f3c8034c85dfff0	# 4.34887777707614552256e-04
+				.quad 0x03f3c8034c85dfff0
+
+.L__real_cb1:			.quad 0x03fb5555555555557	# 8.33333333333333593622e-02
+				.quad 0x03fb5555555555557
+.L__real_cb2:			.quad 0x03f89999999865ede	# 1.24999999978138668903e-02
+				.quad 0x03f89999999865ede
+.L__real_cb3:			.quad 0x03f6249423bd94741	# 2.23219810758559851206e-03
+				.quad 0x03f6249423bd94741
+.L__real_log2_lead:  		.quad 0x03fe62e42e0000000	# log2_lead	  6.93147122859954833984e-01
+				.quad 0x03fe62e42e0000000
+.L__real_log2_tail: 		.quad 0x03e6efa39ef35793c	# log2_tail	  5.76999904754328540596e-08
+				.quad 0x03e6efa39ef35793c
+
+.L__real_half:			.quad 0x03fe0000000000000	# 1/2
+				.quad 0x03fe0000000000000
+
+	.align	16
+
+.L__np_ln_lead_table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3f8fc0a800000000		# 1.55041813850402832031e-02
+	.quad	0x3f9f829800000000		# 3.07716131210327148438e-02
+	.quad	0x3fa7745800000000		# 4.58095073699951171875e-02
+	.quad	0x3faf0a3000000000		# 6.06245994567871093750e-02
+	.quad	0x3fb341d700000000		# 7.52233862876892089844e-02
+	.quad	0x3fb6f0d200000000		# 8.96121263504028320312e-02
+	.quad	0x3fba926d00000000		# 1.03796780109405517578e-01
+	.quad	0x3fbe270700000000		# 1.17783010005950927734e-01
+	.quad	0x3fc0d77e00000000		# 1.31576299667358398438e-01
+	.quad	0x3fc2955280000000		# 1.45181953907012939453e-01
+	.quad	0x3fc44d2b00000000		# 1.58604979515075683594e-01
+	.quad	0x3fc5ff3000000000		# 1.71850204467773437500e-01
+	.quad	0x3fc7ab8900000000		# 1.84922337532043457031e-01
+	.quad	0x3fc9525a80000000		# 1.97825729846954345703e-01
+	.quad	0x3fcaf3c900000000		# 2.10564732551574707031e-01
+	.quad	0x3fcc8ff780000000		# 2.23143517971038818359e-01
+	.quad	0x3fce270700000000		# 2.35566020011901855469e-01
+	.quad	0x3fcfb91800000000		# 2.47836112976074218750e-01
+	.quad	0x3fd0a324c0000000		# 2.59957492351531982422e-01
+	.quad	0x3fd1675c80000000		# 2.71933674812316894531e-01
+	.quad	0x3fd22941c0000000		# 2.83768117427825927734e-01
+	.quad	0x3fd2e8e280000000		# 2.95464158058166503906e-01
+	.quad	0x3fd3a64c40000000		# 3.07025015354156494141e-01
+	.quad	0x3fd4618bc0000000		# 3.18453729152679443359e-01
+	.quad	0x3fd51aad80000000		# 3.29753279685974121094e-01
+	.quad	0x3fd5d1bd80000000		# 3.40926527976989746094e-01
+	.quad	0x3fd686c800000000		# 3.51976394653320312500e-01
+	.quad	0x3fd739d7c0000000		# 3.62905442714691162109e-01
+	.quad	0x3fd7eaf800000000		# 3.73716354370117187500e-01
+	.quad	0x3fd89a3380000000		# 3.84411692619323730469e-01
+	.quad	0x3fd9479400000000		# 3.94993782043457031250e-01
+	.quad	0x3fd9f323c0000000		# 4.05465066432952880859e-01
+	.quad	0x3fda9cec80000000		# 4.15827870368957519531e-01
+	.quad	0x3fdb44f740000000		# 4.26084339618682861328e-01
+	.quad	0x3fdbeb4d80000000		# 4.36236739158630371094e-01
+	.quad	0x3fdc8ff7c0000000		# 4.46287095546722412109e-01
+	.quad	0x3fdd32fe40000000		# 4.56237375736236572266e-01
+	.quad	0x3fddd46a00000000		# 4.66089725494384765625e-01
+	.quad	0x3fde744240000000		# 4.75845873355865478516e-01
+	.quad	0x3fdf128f40000000		# 4.85507786273956298828e-01
+	.quad	0x3fdfaf5880000000		# 4.95077252388000488281e-01
+	.quad	0x3fe02552a0000000		# 5.04556000232696533203e-01
+	.quad	0x3fe0723e40000000		# 5.13945698738098144531e-01
+	.quad	0x3fe0be72e0000000		# 5.23248136043548583984e-01
+	.quad	0x3fe109f380000000		# 5.32464742660522460938e-01
+	.quad	0x3fe154c3c0000000		# 5.41597247123718261719e-01
+	.quad	0x3fe19ee6a0000000		# 5.50647079944610595703e-01
+	.quad	0x3fe1e85f40000000		# 5.59615731239318847656e-01
+	.quad	0x3fe23130c0000000		# 5.68504691123962402344e-01
+	.quad	0x3fe2795e00000000		# 5.77315330505371093750e-01
+	.quad	0x3fe2c0e9e0000000		# 5.86049020290374755859e-01
+	.quad	0x3fe307d720000000		# 5.94707071781158447266e-01
+	.quad	0x3fe34e2880000000		# 6.03290796279907226562e-01
+	.quad	0x3fe393e0c0000000		# 6.11801505088806152344e-01
+	.quad	0x3fe3d90260000000		# 6.20240390300750732422e-01
+	.quad	0x3fe41d8fe0000000		# 6.28608644008636474609e-01
+	.quad	0x3fe4618bc0000000		# 6.36907458305358886719e-01
+	.quad	0x3fe4a4f840000000		# 6.45137906074523925781e-01
+	.quad	0x3fe4e7d800000000		# 6.53301239013671875000e-01
+	.quad	0x3fe52a2d20000000		# 6.61398470401763916016e-01
+	.quad	0x3fe56bf9c0000000		# 6.69430613517761230469e-01
+	.quad	0x3fe5ad4040000000		# 6.77398800849914550781e-01
+	.quad	0x3fe5ee02a0000000		# 6.85303986072540283203e-01
+	.quad	0x3fe62e42e0000000		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_tail_table:
+	.quad	0x00000000000000000 # 0	; 0.00000000000000000000e+00
+	.quad	0x03e361f807c79f3db		# 5.15092497094772879206e-09
+	.quad	0x03e6873c1980267c8		# 4.55457209735272790188e-08
+	.quad	0x03e5ec65b9f88c69e		# 2.86612990859791781788e-08
+	.quad	0x03e58022c54cc2f99		# 2.23596477332056055352e-08
+	.quad	0x03e62c37a3a125330		# 3.49498983167142274770e-08
+	.quad	0x03e615cad69737c93		# 3.23392843005887000414e-08
+	.quad	0x03e4d256ab1b285e9		# 1.35722380472479366661e-08
+	.quad	0x03e5b8abcb97a7aa2		# 2.56504325268044191098e-08
+	.quad	0x03e6f34239659a5dc		# 5.81213608741512136843e-08
+	.quad	0x03e6e07fd48d30177		# 5.59374849578288093334e-08
+	.quad	0x03e6b32df4799f4f6		# 5.06615629004996189970e-08
+	.quad	0x03e6c29e4f4f21cf8		# 5.24588857848400955725e-08
+	.quad	0x03e1086c848df1b59		# 9.61968535632653505972e-10
+	.quad	0x03e4cf456b4764130		# 1.34829655346594463137e-08
+	.quad	0x03e63a02ffcb63398		# 3.65557749306383026498e-08
+	.quad	0x03e61e6a6886b0976		# 3.33431709374069198903e-08
+	.quad	0x03e6b8abcb97a7aa2		# 5.13008650536088382197e-08
+	.quad	0x03e6b578f8aa35552		# 5.09285070380306053751e-08
+	.quad	0x03e6139c871afb9fc		# 3.20853940845502057341e-08
+	.quad	0x03e65d5d30701ce64		# 4.06713248643004200446e-08
+	.quad	0x03e6de7bcb2d12142		# 5.57028186706125221168e-08
+	.quad	0x03e6d708e984e1664		# 5.48356693724804282546e-08
+	.quad	0x03e556945e9c72f36		# 1.99407553679345001938e-08
+	.quad	0x03e20e2f613e85bda		# 1.96585517245087232086e-09
+	.quad	0x03e3cb7e0b42724f6		# 6.68649386072067321503e-09
+	.quad	0x03e6fac04e52846c7		# 5.89936034642113390002e-08
+	.quad	0x03e5e9b14aec442be		# 2.85038578721554472484e-08
+	.quad	0x03e6b5de8034e7126		# 5.09746772910284482606e-08
+	.quad	0x03e6dc157e1b259d3		# 5.54234668933210171467e-08
+	.quad	0x03e3b05096ad69c62		# 6.29100830926604004874e-09
+	.quad	0x03e5c2116faba4cdd		# 2.61974119468563937716e-08
+	.quad	0x03e665fcc25f95b47		# 4.16752115011186398935e-08
+	.quad	0x03e5a9a08498d4850		# 2.47747534460820790327e-08
+	.quad	0x03e6de647b1465f77		# 5.56922172017964209793e-08
+	.quad	0x03e5da71b7bf7861d		# 2.76162876992552906035e-08
+	.quad	0x03e3e6a6886b09760		# 7.08169709942321478061e-09
+	.quad	0x03e6f0075eab0ef64		# 5.77453510221151779025e-08
+	.quad	0x03e33071282fb989b		# 4.43021445893361960146e-09
+	.quad	0x03e60eb43c3f1bed2		# 3.15140984357495864573e-08
+	.quad	0x03e5faf06ecb35c84		# 2.95077445089736670973e-08
+	.quad	0x03e4ef1e63db35f68		# 1.44098510263167149349e-08
+	.quad	0x03e469743fb1a71a5		# 1.05196987538551827693e-08
+	.quad	0x03e6c1cdf404e5796		# 5.23641361722697546261e-08
+	.quad	0x03e4094aa0ada625e		# 7.72099925253243069458e-09
+	.quad	0x03e6e2d4c96fde3ec		# 5.62089493829364197156e-08
+	.quad	0x03e62f4d5e9a98f34		# 3.53090261098577946927e-08
+	.quad	0x03e6467c96ecc5cbe		# 3.80080516835568242269e-08
+	.quad	0x03e6e7040d03dec5a		# 5.66961038386146408282e-08
+	.quad	0x03e67bebf4282de36		# 4.42287063097349852717e-08
+	.quad	0x03e6289b11aeb783f		# 3.45294525105681104660e-08
+	.quad	0x03e5a891d1772f538		# 2.47132034530447431509e-08
+	.quad	0x03e634f10be1fb591		# 3.59655343422487209774e-08
+	.quad	0x03e6d9ce1d316eb93		# 5.51581770357780862071e-08
+	.quad	0x03e63562a19a9c442		# 3.60171867511861372793e-08
+	.quad	0x03e54e2adf548084c		# 1.94511067964296180547e-08
+	.quad	0x03e508ce55cc8c97a		# 1.54137376631349347838e-08
+	.quad	0x03e30e2f613e85bda		# 3.93171034490174464173e-09
+	.quad	0x03e6db03ebb0227bf		# 5.52990607758839766440e-08
+	.quad	0x03e61b75bb09cb098		# 3.29990737637586136511e-08
+	.quad	0x03e496f16abb9df22		# 1.18436010922446096216e-08
+	.quad	0x03e65b3f399411c62		# 4.04248680368301346709e-08
+	.quad	0x03e586b3e59f65355		# 2.27418915900284316293e-08
+	.quad	0x03e52482ceae1ac12		# 1.70263791333409206020e-08
+	.quad	0x03e6efa39ef35793c		# 5.76999904754328540596e-08
+	.quad 0					# for alignment
+
+

diff --git a/src/gas/vrd2log10.S b/src/gas/vrd2log10.S
new file mode 100644
index 0000000..46cb2ad
--- /dev/null
+++ b/src/gas/vrd2log10.S

@@ -0,0 +1,628 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd2log10.s
+#
+# An implementation of the log10 libm function.
+#
+# Prototype:
+#
+#     __m128d __vrd2_log10(__m128d x);
+#
+#   Computes the natural log10 of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.  Runs 120-130 cycles for valid inputs.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+.equ	p_idx,0x010		# index storage
+
+.equ	stack_size,0x028
+
+
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrd2_log10
+    .type   __vrd2_log10,@function
+__vrd2_log10:
+	sub		$stack_size,%rsp
+
+	movdqa	%xmm0,p_x(%rsp)	# save the input values
+
+
+#      /* Store the exponent of x in xexp and put
+#         f into the range [0.5,1) */
+
+#
+# compute the index into the log10 tables
+#
+	pxor	%xmm1,%xmm1
+	movdqa	%xmm0,%xmm3
+	psrlq	$52,%xmm3
+	psubq	.L__mask_1023(%rip),%xmm3
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm6			# xexp
+	movdqa	%xmm0,%xmm2
+	xor		%rax,%rax
+	subpd	.L__real_one(%rip),%xmm2
+
+	movdqa	%xmm0,%xmm3
+	andpd	.L__real_notsign(%rip),%xmm2
+	pand	.L__real_mant(%rip),%xmm3
+	movdqa	%xmm3,%xmm4
+	movapd	.L__real_half(%rip),%xmm5							# .5
+
+	cmppd	$1,.L__real_threshold(%rip),%xmm2
+	movmskpd	%xmm2,%r10d
+	cmp		$3,%r10d
+	jz		.Lall_nearone
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrlq	$45,%xmm3
+	movdqa	%xmm3,%xmm2
+	psrlq	$1,%xmm3
+	paddq	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm2
+	paddq	%xmm2,%xmm3
+
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm1
+	xor		%rcx,%rcx
+	movq	 %xmm3,p_idx(%rsp)
+
+# reduce and get u
+	por		.L__real_half(%rip),%xmm4
+	movdqa	%xmm4,%xmm2
+
+
+	mulpd	.L__real_3f80000000000000(%rip),%xmm1		# f1 = index/128
+
+
+	lea	.L__np_ln_lead_table(%rip),%rdx
+	mov	p_idx(%rsp),%eax
+
+	subpd	%xmm1,%xmm2				# f2 = f - f1
+	mulpd	%xmm2,%xmm5
+	addpd	%xmm5,%xmm1
+
+	divpd	%xmm1,%xmm2				# u
+
+# do error checking here for scheduling.  Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+##  if NaN or inf
+	movapd	%xmm0,%xmm3
+	andpd	.L__real_inf(%rip),%xmm3
+	cmppd	$0,.L__real_inf(%rip),%xmm3
+	movmskpd	%xmm3,%r8d
+	xorpd	%xmm1,%xmm1
+
+	cmppd	$2,%xmm1,%xmm0
+	movmskpd	%xmm0,%r9d
+
+# get z
+	movlpd	 -512(%rdx,%rax,8),%xmm0		# z1
+	mov		p_idx+4(%rsp),%ecx
+	movhpd	 -512(%rdx,%rcx,8),%xmm0		# z1
+# solve for ln(1+u)
+	movapd	%xmm2,%xmm1				# u
+	mulpd	%xmm2,%xmm2				# u^2
+	movapd	%xmm2,%xmm5
+	movapd	.L__real_cb3(%rip),%xmm3
+	mulpd	%xmm2,%xmm3				#Cu2
+	mulpd	%xmm1,%xmm5				# u^3
+	addpd	.L__real_cb2(%rip),%xmm3 		#B+Cu2
+
+	mulpd	%xmm5,%xmm2				# u^5
+	movapd	.L__real_log2_lead(%rip),%xmm4
+
+	mulpd	.L__real_cb1(%rip),%xmm5 		#Au3
+	addpd	%xmm5,%xmm1				# u+Au3
+	mulpd	%xmm3,%xmm2				# u5(B+Cu2)
+
+	addpd	%xmm2,%xmm1				# poly
+# recombine
+	mulpd	%xmm6,%xmm4				# xexp * log2_lead
+	addpd	%xmm4,%xmm0				#r1
+	movapd  %xmm0,%xmm2
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	movlpd	 -512(%rdx,%rax,8),%xmm4		#z2	+=q
+	movhpd	 -512(%rdx,%rcx,8),%xmm4		#z2	+=q
+	addpd	%xmm4,%xmm1
+
+	mulpd	.L__real_log2_tail(%rip),%xmm6
+
+	addpd	%xmm6,%xmm1				#r2
+
+#   loge to log10
+	movapd  %xmm1,%xmm3
+	mulpd 	.L__real_log10e_tail(%rip),%xmm1
+	mulpd 	.L__real_log10e_tail(%rip),%xmm0
+	addpd   %xmm1,%xmm0
+	mulpd 	.L__real_log10e_lead(%rip),%xmm3
+	addpd 	%xmm3,%xmm0
+	mulpd 	.L__real_log10e_lead(%rip),%xmm2
+# check for nans/infs
+	test	$3,%r8d
+	addpd	%xmm2,%xmm0
+
+	jnz		.L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+	test		$3,%r9d
+	jnz		.L__z_or_n
+
+.L__finish:
+# see if we have a near one value
+	test		 $3,%r10d
+	jnz		.L__near_one
+.L__finishn1:
+	add		$stack_size,%rsp
+	ret
+
+	.align	16
+.Lall_nearone:
+# saves 10 cycles
+#      r = x - 1.0;
+	movapd	.L__real_two(%rip),%xmm2
+	subpd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addpd	%xmm0,%xmm2
+	movapd	%xmm0,%xmm1
+	divpd	%xmm2,%xmm1			# u
+	movapd	.L__real_ca4(%rip),%xmm4	  #D
+	movapd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movapd	%xmm0,%xmm6
+	mulpd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addpd	%xmm1,%xmm1		#u
+	movapd	%xmm1,%xmm2
+	mulpd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulpd	%xmm1,%xmm5		# Cu
+	movapd	%xmm1,%xmm3
+	mulpd	%xmm2,%xmm3		# u^3
+	mulpd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulpd	%xmm3,%xmm4		#Du^3
+
+	addpd	.L__real_ca1(%rip),%xmm2	# +A
+	movapd	%xmm3,%xmm1
+	mulpd	%xmm1,%xmm1		# u^6
+	addpd	%xmm4,%xmm5		#Cu+Du3
+#	subsd	%xmm6,%xmm0		; -correction
+
+	mulpd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulpd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addpd	%xmm1,%xmm2
+	subpd	%xmm6,%xmm2		# -correction
+
+#	loge to log10
+	movapd 	%xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subpd	%xmm3,%xmm0
+	addpd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movapd 	%xmm3,%xmm0
+	movapd	%xmm2,%xmm1
+
+	mulpd 	.L__real_log10e_tail(%rip),%xmm2
+	mulpd 	.L__real_log10e_tail(%rip),%xmm0
+	mulpd 	.L__real_log10e_lead(%rip),%xmm1
+	mulpd 	.L__real_log10e_lead(%rip),%xmm3
+	addpd 	%xmm2,%xmm0
+	addpd 	%xmm1,%xmm0
+	addpd	%xmm3,%xmm0
+
+#      return r + r2;
+#	addpd	%xmm2,%xmm0
+	jmp		.L__finishn1
+
+	.align	16
+.L__near_one:
+	test	$1,%r10d
+	jz		.L__lnn12
+
+#	movapd	%xmm0,%xmm6		; save the inputs
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__ln1
+#	shufpd	xmm0,$2,%xmm6
+
+.L__lnn12:
+	test	$2,%r10d		# second number?
+	jz		.L__lnn1e
+	movlpd	%xmm0,p_x(%rsp)
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,p_x+8(%rsp)
+	movapd	p_x(%rsp),%xmm0
+#	shufpd	xmm6,$0,%xmm0
+#	movapd	%xmm6,%xmm0
+
+.L__lnn1e:
+	jmp		.L__finishn1
+
+.L__ln1:
+# saves 10 cycles
+#      r = x - 1.0;
+	movlpd	.L__real_two(%rip),%xmm2
+	subsd	.L__real_one(%rip),%xmm0	# r
+#      u          = r / (2.0 + r);
+	addsd	%xmm0,%xmm2
+	movsd	%xmm0,%xmm1
+	divsd	%xmm2,%xmm1			# u
+	movlpd	.L__real_ca4(%rip),%xmm4	#D
+	movlpd	.L__real_ca3(%rip),%xmm5	#C
+#      correction = r * u;
+	movsd	%xmm0,%xmm6
+	mulsd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addsd	%xmm1,%xmm1		#u
+	movsd	%xmm1,%xmm2
+	mulsd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulsd	%xmm1,%xmm5		# Cu
+	movsd	%xmm1,%xmm3
+	mulsd	%xmm2,%xmm3		# u^3
+	mulsd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulsd	%xmm3,%xmm4		#Du^3
+
+	addsd	.L__real_ca1(%rip),%xmm2	# +A
+	movsd	%xmm3,%xmm1
+	mulsd	%xmm1,%xmm1		# u^6
+	addsd	%xmm4,%xmm5		#Cu+Du3
+#	subsd	%xmm6,%xmm0		; -correction
+
+	mulsd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulsd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addsd	%xmm1,%xmm2
+	subsd	%xmm6,%xmm2		# -correction
+
+#	loge to log10
+	movsd 	%xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subsd	%xmm3,%xmm0
+	addsd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movsd 	%xmm3,%xmm0
+	movsd	%xmm2,%xmm1
+
+	mulsd 	.L__real_log10e_tail(%rip),%xmm2
+	mulsd 	.L__real_log10e_tail(%rip),%xmm0
+	mulsd 	.L__real_log10e_lead(%rip),%xmm1
+	mulsd 	.L__real_log10e_lead(%rip),%xmm3
+	addsd 	%xmm2,%xmm0
+	addsd 	%xmm1,%xmm0
+	addsd	%xmm3,%xmm0
+
+
+
+#      return r + r2;
+#	addsd	%xmm2,%xmm0
+	ret
+
+	.align	16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+	test		$1,%r8d		# first number?
+	jz		.L__lninf2
+
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rdx
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm1,%xmm0
+
+.L__lninf2:
+	test		$2,%r8d		# second number?
+	jz		.L__lninfe
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov	p_x+8(%rsp),%rdx
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+
+.L__lninfe:
+
+	cmp		$3,%r8d		# both numbers?
+	jz		.L__finish	# return early if so
+	jmp		.L__vlog1	# continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__lnan	# jump if mantissa not zero, so it's a NaN
+# inf
+	rcl		$1,%rdx
+	jnc		.L__lne2	# log(+inf) = inf
+# negative x
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+#NaN
+.L__lnan:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__lne:
+	movd	%rdx,%xmm0
+.L__lne2:
+	ret
+
+	.align	16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+	test		$1,%r9d		# first number?
+	jz		.L__zn2
+
+#	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rax
+	call	.L__zni
+#	shufpd	$2,%xmm1,%xmm0
+
+.L__zn2:
+	test		$2,%r9d		# second number?
+	jz		.L__zne
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+
+.L__zne:
+	jmp		.L__finish
+
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+	shl		$1,%rax
+	jnz		.L__zn_x		 ## if just a carry, then must be negative
+	movlpd	.L__real_ninf(%rip),%xmm0  # C99 specs -inf for +-0
+	ret
+.L__zn_x:
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+
+
+
+	.data
+	.align	16
+
+.L__real_one:			.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_two:			.quad 0x04000000000000000	# 2.0
+				.quad 0x04000000000000000
+.L__real_ninf:			.quad 0x0fff0000000000000	# -inf
+				.quad 0x0fff0000000000000
+.L__real_inf:			.quad 0x07ff0000000000000	# +inf
+				.quad 0x07ff0000000000000
+.L__real_nan:			.quad 0x07ff8000000000000	# NaN
+				.quad 0x07ff8000000000000
+
+.L__real_sign:			.quad 0x08000000000000000	# sign bit
+				.quad 0x08000000000000000
+.L__real_notsign:		.quad 0x07ffFFFFFFFFFFFFF	# ^sign bit
+				.quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold:		.quad 0x03FB082C000000000	# .064495086669921875 Threshold
+				.quad 0x03FB082C000000000
+.L__real_qnanbit:		.quad 0x00008000000000000	# quiet nan bit
+				.quad 0x00008000000000000
+.L__real_mant:			.quad 0x0000FFFFFFFFFFFFF	# mantissa bits
+				.quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000:	.quad 0x03f80000000000000	# /* 0.0078125 = 1/128 */
+				.quad 0x03f80000000000000
+.L__mask_1023:			.quad 0x000000000000003ff	#
+				.quad 0x000000000000003ff
+.L__mask_040:			.quad 0x00000000000000040	#
+				.quad 0x00000000000000040
+.L__mask_001:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001
+
+.L__real_ca1:			.quad 0x03fb55555555554e6	# 8.33333333333317923934e-02
+				.quad 0x03fb55555555554e6
+.L__real_ca2:			.quad 0x03f89999999bac6d4	# 1.25000000037717509602e-02
+				.quad 0x03f89999999bac6d4
+.L__real_ca3:			.quad 0x03f62492307f1519f	# 2.23213998791944806202e-03
+				.quad 0x03f62492307f1519f
+.L__real_ca4:			.quad 0x03f3c8034c85dfff0	# 4.34887777707614552256e-04
+				.quad 0x03f3c8034c85dfff0
+
+.L__real_cb1:			.quad 0x03fb5555555555557	# 8.33333333333333593622e-02
+				.quad 0x03fb5555555555557
+.L__real_cb2:			.quad 0x03f89999999865ede	# 1.24999999978138668903e-02
+				.quad 0x03f89999999865ede
+.L__real_cb3:			.quad 0x03f6249423bd94741	# 2.23219810758559851206e-03
+				.quad 0x03f6249423bd94741
+.L__real_log2_lead:  		.quad 0x03fe62e42e0000000	# log2_lead	  6.93147122859954833984e-01
+				.quad 0x03fe62e42e0000000
+.L__real_log2_tail: 		.quad 0x03e6efa39ef35793c	# log2_tail	  5.76999904754328540596e-08
+				.quad 0x03e6efa39ef35793c
+
+.L__real_half:			.quad 0x03fe0000000000000	# 1/2
+				.quad 0x03fe0000000000000
+
+.L__real_log10e_lead:	.quad 0x03fdbcb7800000000	# log10e_lead 4.34293746948242187500e-01
+				.quad 0x03fdbcb7800000000
+.L__real_log10e_tail:	.quad 0x03ea8a93728719535	# log10e_tail 7.3495500964015109100644e-7
+				.quad 0x03ea8a93728719535
+
+.L__mask_lower:			.quad 0x0ffffffff00000000
+				.quad 0x0ffffffff00000000
+
+	.align	16
+
+.L__np_ln_lead_table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3f8fc0a800000000		# 1.55041813850402832031e-02
+	.quad	0x3f9f829800000000		# 3.07716131210327148438e-02
+	.quad	0x3fa7745800000000		# 4.58095073699951171875e-02
+	.quad	0x3faf0a3000000000		# 6.06245994567871093750e-02
+	.quad	0x3fb341d700000000		# 7.52233862876892089844e-02
+	.quad	0x3fb6f0d200000000		# 8.96121263504028320312e-02
+	.quad	0x3fba926d00000000		# 1.03796780109405517578e-01
+	.quad	0x3fbe270700000000		# 1.17783010005950927734e-01
+	.quad	0x3fc0d77e00000000		# 1.31576299667358398438e-01
+	.quad	0x3fc2955280000000		# 1.45181953907012939453e-01
+	.quad	0x3fc44d2b00000000		# 1.58604979515075683594e-01
+	.quad	0x3fc5ff3000000000		# 1.71850204467773437500e-01
+	.quad	0x3fc7ab8900000000		# 1.84922337532043457031e-01
+	.quad	0x3fc9525a80000000		# 1.97825729846954345703e-01
+	.quad	0x3fcaf3c900000000		# 2.10564732551574707031e-01
+	.quad	0x3fcc8ff780000000		# 2.23143517971038818359e-01
+	.quad	0x3fce270700000000		# 2.35566020011901855469e-01
+	.quad	0x3fcfb91800000000		# 2.47836112976074218750e-01
+	.quad	0x3fd0a324c0000000		# 2.59957492351531982422e-01
+	.quad	0x3fd1675c80000000		# 2.71933674812316894531e-01
+	.quad	0x3fd22941c0000000		# 2.83768117427825927734e-01
+	.quad	0x3fd2e8e280000000		# 2.95464158058166503906e-01
+	.quad	0x3fd3a64c40000000		# 3.07025015354156494141e-01
+	.quad	0x3fd4618bc0000000		# 3.18453729152679443359e-01
+	.quad	0x3fd51aad80000000		# 3.29753279685974121094e-01
+	.quad	0x3fd5d1bd80000000		# 3.40926527976989746094e-01
+	.quad	0x3fd686c800000000		# 3.51976394653320312500e-01
+	.quad	0x3fd739d7c0000000		# 3.62905442714691162109e-01
+	.quad	0x3fd7eaf800000000		# 3.73716354370117187500e-01
+	.quad	0x3fd89a3380000000		# 3.84411692619323730469e-01
+	.quad	0x3fd9479400000000		# 3.94993782043457031250e-01
+	.quad	0x3fd9f323c0000000		# 4.05465066432952880859e-01
+	.quad	0x3fda9cec80000000		# 4.15827870368957519531e-01
+	.quad	0x3fdb44f740000000		# 4.26084339618682861328e-01
+	.quad	0x3fdbeb4d80000000		# 4.36236739158630371094e-01
+	.quad	0x3fdc8ff7c0000000		# 4.46287095546722412109e-01
+	.quad	0x3fdd32fe40000000		# 4.56237375736236572266e-01
+	.quad	0x3fddd46a00000000		# 4.66089725494384765625e-01
+	.quad	0x3fde744240000000		# 4.75845873355865478516e-01
+	.quad	0x3fdf128f40000000		# 4.85507786273956298828e-01
+	.quad	0x3fdfaf5880000000		# 4.95077252388000488281e-01
+	.quad	0x3fe02552a0000000		# 5.04556000232696533203e-01
+	.quad	0x3fe0723e40000000		# 5.13945698738098144531e-01
+	.quad	0x3fe0be72e0000000		# 5.23248136043548583984e-01
+	.quad	0x3fe109f380000000		# 5.32464742660522460938e-01
+	.quad	0x3fe154c3c0000000		# 5.41597247123718261719e-01
+	.quad	0x3fe19ee6a0000000		# 5.50647079944610595703e-01
+	.quad	0x3fe1e85f40000000		# 5.59615731239318847656e-01
+	.quad	0x3fe23130c0000000		# 5.68504691123962402344e-01
+	.quad	0x3fe2795e00000000		# 5.77315330505371093750e-01
+	.quad	0x3fe2c0e9e0000000		# 5.86049020290374755859e-01
+	.quad	0x3fe307d720000000		# 5.94707071781158447266e-01
+	.quad	0x3fe34e2880000000		# 6.03290796279907226562e-01
+	.quad	0x3fe393e0c0000000		# 6.11801505088806152344e-01
+	.quad	0x3fe3d90260000000		# 6.20240390300750732422e-01
+	.quad	0x3fe41d8fe0000000		# 6.28608644008636474609e-01
+	.quad	0x3fe4618bc0000000		# 6.36907458305358886719e-01
+	.quad	0x3fe4a4f840000000		# 6.45137906074523925781e-01
+	.quad	0x3fe4e7d800000000		# 6.53301239013671875000e-01
+	.quad	0x3fe52a2d20000000		# 6.61398470401763916016e-01
+	.quad	0x3fe56bf9c0000000		# 6.69430613517761230469e-01
+	.quad	0x3fe5ad4040000000		# 6.77398800849914550781e-01
+	.quad	0x3fe5ee02a0000000		# 6.85303986072540283203e-01
+	.quad	0x3fe62e42e0000000		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_tail_table:
+	.quad	0x00000000000000000 # 0	; 0.00000000000000000000e+00
+	.quad	0x03e361f807c79f3db		# 5.15092497094772879206e-09
+	.quad	0x03e6873c1980267c8		# 4.55457209735272790188e-08
+	.quad	0x03e5ec65b9f88c69e		# 2.86612990859791781788e-08
+	.quad	0x03e58022c54cc2f99		# 2.23596477332056055352e-08
+	.quad	0x03e62c37a3a125330		# 3.49498983167142274770e-08
+	.quad	0x03e615cad69737c93		# 3.23392843005887000414e-08
+	.quad	0x03e4d256ab1b285e9		# 1.35722380472479366661e-08
+	.quad	0x03e5b8abcb97a7aa2		# 2.56504325268044191098e-08
+	.quad	0x03e6f34239659a5dc		# 5.81213608741512136843e-08
+	.quad	0x03e6e07fd48d30177		# 5.59374849578288093334e-08
+	.quad	0x03e6b32df4799f4f6		# 5.06615629004996189970e-08
+	.quad	0x03e6c29e4f4f21cf8		# 5.24588857848400955725e-08
+	.quad	0x03e1086c848df1b59		# 9.61968535632653505972e-10
+	.quad	0x03e4cf456b4764130		# 1.34829655346594463137e-08
+	.quad	0x03e63a02ffcb63398		# 3.65557749306383026498e-08
+	.quad	0x03e61e6a6886b0976		# 3.33431709374069198903e-08
+	.quad	0x03e6b8abcb97a7aa2		# 5.13008650536088382197e-08
+	.quad	0x03e6b578f8aa35552		# 5.09285070380306053751e-08
+	.quad	0x03e6139c871afb9fc		# 3.20853940845502057341e-08
+	.quad	0x03e65d5d30701ce64		# 4.06713248643004200446e-08
+	.quad	0x03e6de7bcb2d12142		# 5.57028186706125221168e-08
+	.quad	0x03e6d708e984e1664		# 5.48356693724804282546e-08
+	.quad	0x03e556945e9c72f36		# 1.99407553679345001938e-08
+	.quad	0x03e20e2f613e85bda		# 1.96585517245087232086e-09
+	.quad	0x03e3cb7e0b42724f6		# 6.68649386072067321503e-09
+	.quad	0x03e6fac04e52846c7		# 5.89936034642113390002e-08
+	.quad	0x03e5e9b14aec442be		# 2.85038578721554472484e-08
+	.quad	0x03e6b5de8034e7126		# 5.09746772910284482606e-08
+	.quad	0x03e6dc157e1b259d3		# 5.54234668933210171467e-08
+	.quad	0x03e3b05096ad69c62		# 6.29100830926604004874e-09
+	.quad	0x03e5c2116faba4cdd		# 2.61974119468563937716e-08
+	.quad	0x03e665fcc25f95b47		# 4.16752115011186398935e-08
+	.quad	0x03e5a9a08498d4850		# 2.47747534460820790327e-08
+	.quad	0x03e6de647b1465f77		# 5.56922172017964209793e-08
+	.quad	0x03e5da71b7bf7861d		# 2.76162876992552906035e-08
+	.quad	0x03e3e6a6886b09760		# 7.08169709942321478061e-09
+	.quad	0x03e6f0075eab0ef64		# 5.77453510221151779025e-08
+	.quad	0x03e33071282fb989b		# 4.43021445893361960146e-09
+	.quad	0x03e60eb43c3f1bed2		# 3.15140984357495864573e-08
+	.quad	0x03e5faf06ecb35c84		# 2.95077445089736670973e-08
+	.quad	0x03e4ef1e63db35f68		# 1.44098510263167149349e-08
+	.quad	0x03e469743fb1a71a5		# 1.05196987538551827693e-08
+	.quad	0x03e6c1cdf404e5796		# 5.23641361722697546261e-08
+	.quad	0x03e4094aa0ada625e		# 7.72099925253243069458e-09
+	.quad	0x03e6e2d4c96fde3ec		# 5.62089493829364197156e-08
+	.quad	0x03e62f4d5e9a98f34		# 3.53090261098577946927e-08
+	.quad	0x03e6467c96ecc5cbe		# 3.80080516835568242269e-08
+	.quad	0x03e6e7040d03dec5a		# 5.66961038386146408282e-08
+	.quad	0x03e67bebf4282de36		# 4.42287063097349852717e-08
+	.quad	0x03e6289b11aeb783f		# 3.45294525105681104660e-08
+	.quad	0x03e5a891d1772f538		# 2.47132034530447431509e-08
+	.quad	0x03e634f10be1fb591		# 3.59655343422487209774e-08
+	.quad	0x03e6d9ce1d316eb93		# 5.51581770357780862071e-08
+	.quad	0x03e63562a19a9c442		# 3.60171867511861372793e-08
+	.quad	0x03e54e2adf548084c		# 1.94511067964296180547e-08
+	.quad	0x03e508ce55cc8c97a		# 1.54137376631349347838e-08
+	.quad	0x03e30e2f613e85bda		# 3.93171034490174464173e-09
+	.quad	0x03e6db03ebb0227bf		# 5.52990607758839766440e-08
+	.quad	0x03e61b75bb09cb098		# 3.29990737637586136511e-08
+	.quad	0x03e496f16abb9df22		# 1.18436010922446096216e-08
+	.quad	0x03e65b3f399411c62		# 4.04248680368301346709e-08
+	.quad	0x03e586b3e59f65355		# 2.27418915900284316293e-08
+	.quad	0x03e52482ceae1ac12		# 1.70263791333409206020e-08
+	.quad	0x03e6efa39ef35793c		# 5.76999904754328540596e-08
+	.quad 0					# for alignment
+
+

diff --git a/src/gas/vrd2log2.S b/src/gas/vrd2log2.S
new file mode 100644
index 0000000..92fe290
--- /dev/null
+++ b/src/gas/vrd2log2.S

@@ -0,0 +1,621 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd2log.s
+#
+# An implementation of the log libm function.
+#
+# Prototype:
+#
+#     __m128d __vrd2_log2(__m128d x);
+#
+#   Computes the  log2 of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.  Runs 105-115 cycles for valid inputs.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+.equ	p_idx,0x010		# index storage
+
+.equ	stack_size,0x028
+
+
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrd2_log2
+    .type   __vrd2_log2,@function
+__vrd2_log2:
+	sub		$stack_size,%rsp
+
+	movdqa	%xmm0,p_x(%rsp)	# save the input values
+
+
+#      /* Store the exponent of x in xexp and put
+#         f into the range [0.5,1) */
+
+#
+# compute the index into the log tables
+#
+	pxor	%xmm1,%xmm1
+	movdqa	%xmm0,%xmm3
+	psrlq	$52,%xmm3
+	psubq	.L__mask_1023(%rip),%xmm3
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm6			# xexp
+	movdqa	%xmm0,%xmm2
+	xor		%rax,%rax
+	subpd	.L__real_one(%rip),%xmm2
+
+	movdqa	%xmm0,%xmm3
+	andpd	.L__real_notsign(%rip),%xmm2
+	pand	.L__real_mant(%rip),%xmm3
+	movdqa	%xmm3,%xmm4
+	movapd	.L__real_half(%rip),%xmm5							# .5
+
+	cmppd	$1,.L__real_threshold(%rip),%xmm2
+	movmskpd	%xmm2,%r10d
+	cmp		$3,%r10d
+	jz		.Lall_nearone
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrlq	$45,%xmm3
+	movdqa	%xmm3,%xmm2
+	psrlq	$1,%xmm3
+	paddq	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm2
+	paddq	%xmm2,%xmm3
+
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm1
+	xor		%rcx,%rcx
+	movq	 %xmm3,p_idx(%rsp)
+
+# reduce and get u
+	por		.L__real_half(%rip),%xmm4
+	movdqa	%xmm4,%xmm2
+
+
+	mulpd	.L__real_3f80000000000000(%rip),%xmm1		# f1 = index/128
+
+
+	lea	.L__np_ln_lead_table(%rip),%rdx
+	mov	p_idx(%rsp),%eax
+
+	subpd	%xmm1,%xmm2				# f2 = f - f1
+	mulpd	%xmm2,%xmm5
+	addpd	%xmm5,%xmm1
+
+	divpd	%xmm1,%xmm2				# u
+
+# do error checking here for scheduling.  Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+##  if NaN or inf
+	movapd	%xmm0,%xmm3
+	andpd	.L__real_inf(%rip),%xmm3
+	cmppd	$0,.L__real_inf(%rip),%xmm3
+	movmskpd	%xmm3,%r8d
+	xorpd	%xmm1,%xmm1
+
+	cmppd	$2,%xmm1,%xmm0
+	movmskpd	%xmm0,%r9d
+
+# get z
+	movlpd	 -512(%rdx,%rax,8),%xmm0		# z1
+	mov		p_idx+4(%rsp),%ecx
+	movhpd	 -512(%rdx,%rcx,8),%xmm0		# z1
+# solve for ln(1+u)
+	movapd	%xmm2,%xmm1				# u
+	mulpd	%xmm2,%xmm2				# u^2
+	movapd	%xmm2,%xmm5
+	movapd	.L__real_cb3(%rip),%xmm3
+	mulpd	%xmm2,%xmm3				#Cu2
+	mulpd	%xmm1,%xmm5				# u^3
+	addpd	.L__real_cb2(%rip),%xmm3 		#B+Cu2
+
+	mulpd	%xmm5,%xmm2				# u^5
+	movapd	.L__real_log2e_lead(%rip),%xmm4
+
+	mulpd	.L__real_cb1(%rip),%xmm5 		#Au3
+	addpd	%xmm5,%xmm1				# u+Au3
+	movapd	%xmm0,%xmm5				# z1 copy
+	mulpd	%xmm3,%xmm2				# u5(B+Cu2)
+	movapd	.L__real_log2e_tail(%rip),%xmm3
+	addpd	%xmm2,%xmm1				# poly
+# recombine
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	movlpd	 -512(%rdx,%rax,8),%xmm2		#z2	+=q
+	movhpd	 -512(%rdx,%rcx,8),%xmm2		#z2	+=q
+	addpd	%xmm2,%xmm1
+	movapd	%xmm1,%xmm2	#z2 copy
+
+	mulpd	%xmm4,%xmm5	#z1*log2e_lead
+	mulpd	%xmm4,%xmm1	#z2*log2e_lead
+	mulpd	%xmm3,%xmm2	#z2*log2e_tail
+	mulpd	%xmm3,%xmm0	#z1*log2e_tail
+	addpd	%xmm6,%xmm5	#r1 = z1*log2e_lead + xexp
+	addpd	%xmm2,%xmm0	#z1*log2e_tail + z2*log2e_tail
+	addpd	%xmm1,%xmm0	#r2
+
+
+# check for nans/infs
+	test		$3,%r8d
+	addpd	%xmm5,%xmm0		#r1+r2
+	jnz		.L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+	test		$3,%r9d
+	jnz		.L__z_or_n
+
+.L__finish:
+# see if we have a near one value
+	test		 $3,%r10d
+	jnz		.L__near_one
+.L__finishn1:
+	add		$stack_size,%rsp
+	ret
+
+	.align	16
+.Lall_nearone:
+# saves 10 cycles
+#      r = x - 1.0;
+	movapd	.L__real_two(%rip),%xmm2
+	subpd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addpd	%xmm0,%xmm2
+	movapd	%xmm0,%xmm1
+	divpd	%xmm2,%xmm1			# u
+	movapd	.L__real_ca4(%rip),%xmm4	  #D
+	movapd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movapd	%xmm0,%xmm6
+	mulpd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addpd	%xmm1,%xmm1		#u
+	movapd	%xmm1,%xmm2
+	mulpd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulpd	%xmm1,%xmm5		# Cu
+	movapd	%xmm1,%xmm3
+	mulpd	%xmm2,%xmm3		# u^3
+	mulpd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulpd	%xmm3,%xmm4		#Du^3
+
+	addpd	.L__real_ca1(%rip),%xmm2	# +A
+	movapd	%xmm3,%xmm1
+	mulpd	%xmm1,%xmm1		# u^6
+	addpd	%xmm4,%xmm5		#Cu+Du3
+	movapd	.L__real_log2e_tail(%rip),%xmm4
+#	subsd	%xmm6,%xmm0		; -correction
+
+	mulpd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulpd	%xmm5,%xmm1		#u6(Cu+Du3)
+	movapd	.L__real_log2e_lead(%rip),%xmm5
+	addpd	%xmm1,%xmm2
+	subpd	%xmm6,%xmm2		# -correction
+
+#	loge to log2
+	movapd  %xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subpd	%xmm3,%xmm0
+	addpd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movapd 	%xmm3,%xmm0
+	movapd	%xmm2,%xmm1
+
+	mulpd 	%xmm4,%xmm2
+	mulpd 	%xmm4,%xmm0
+	mulpd 	%xmm5,%xmm1
+	mulpd 	%xmm5,%xmm3
+	addpd 	%xmm2,%xmm0
+	addpd 	%xmm1,%xmm0
+	addpd	%xmm3,%xmm0
+#      return r + r2;
+#	addpd	%xmm2,%xmm0
+	jmp		.L__finishn1
+
+	.align	16
+.L__near_one:
+	test	$1,%r10d
+	jz		.L__lnn12
+
+#	movapd	%xmm0,%xmm6		; save the inputs
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__ln1
+#	shufpd	xmm0,$2,%xmm6
+
+.L__lnn12:
+	test	$2,%r10d		# second number?
+	jz		.L__lnn1e
+	movlpd	%xmm0,p_x(%rsp)
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,p_x+8(%rsp)
+	movapd	p_x(%rsp),%xmm0
+#	shufpd	xmm6,$0,%xmm0
+#	movapd	%xmm6,%xmm0
+
+.L__lnn1e:
+	jmp		.L__finishn1
+
+.L__ln1:
+# saves 10 cycles
+#      r = x - 1.0;
+	movlpd	.L__real_two(%rip),%xmm2
+	subsd	.L__real_one(%rip),%xmm0	# r
+#      u          = r / (2.0 + r);
+	addsd	%xmm0,%xmm2
+	movsd	%xmm0,%xmm1
+	divsd	%xmm2,%xmm1			# u
+	movlpd	.L__real_ca4(%rip),%xmm4	#D
+	movlpd	.L__real_ca3(%rip),%xmm5	#C
+#      correction = r * u;
+	movsd	%xmm0,%xmm6
+	mulsd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addsd	%xmm1,%xmm1		#u
+	movsd	%xmm1,%xmm2
+	mulsd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulsd	%xmm1,%xmm5		# Cu
+	movsd	%xmm1,%xmm3
+	mulsd	%xmm2,%xmm3		# u^3
+	mulsd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulsd	%xmm3,%xmm4		#Du^3
+
+	addsd	.L__real_ca1(%rip),%xmm2	# +A
+	movsd	%xmm3,%xmm1
+	mulsd	%xmm1,%xmm1		# u^6
+	addsd	%xmm4,%xmm5		#Cu+Du3
+#	subsd	%xmm6,%xmm0		; -correction
+	movsd	.L__real_log2e_tail(%rip),%xmm4
+
+	mulsd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulsd	%xmm5,%xmm1		#u6(Cu+Du3)
+	movsd	.L__real_log2e_lead(%rip),%xmm5
+	addsd	%xmm1,%xmm2
+	subsd	%xmm6,%xmm2		# -correction
+
+#	loge to log2
+	movsd 	%xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subsd	%xmm3,%xmm0
+	addsd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movsd 	%xmm3,%xmm0
+	movsd	%xmm2,%xmm1
+
+	mulsd 	%xmm4,%xmm2
+	mulsd 	%xmm4,%xmm0
+	mulsd 	%xmm5,%xmm1
+	mulsd 	%xmm5,%xmm3
+	addsd 	%xmm2,%xmm0
+	addsd 	%xmm1,%xmm0
+	addsd	%xmm3,%xmm0
+
+#      return r + r2;
+#	addsd	%xmm2,%xmm0
+	ret
+
+	.align	16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+	test		$1,%r8d		# first number?
+	jz		.L__lninf2
+
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rdx
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm1,%xmm0
+
+.L__lninf2:
+	test		$2,%r8d		# second number?
+	jz		.L__lninfe
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov	p_x+8(%rsp),%rdx
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+
+.L__lninfe:
+
+	cmp		$3,%r8d		# both numbers?
+	jz		.L__finish	# return early if so
+	jmp		.L__vlog1	# continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__lnan	# jump if mantissa not zero, so it's a NaN
+# inf
+	rcl		$1,%rdx
+	jnc		.L__lne2	# log(+inf) = inf
+# negative x
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+#NaN
+.L__lnan:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__lne:
+	movd	%rdx,%xmm0
+.L__lne2:
+	ret
+
+	.align	16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+	test		$1,%r9d		# first number?
+	jz		.L__zn2
+
+#	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rax
+	call	.L__zni
+#	shufpd	$2,%xmm1,%xmm0
+
+.L__zn2:
+	test		$2,%r9d		# second number?
+	jz		.L__zne
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+
+.L__zne:
+	jmp		.L__finish
+
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+	shl		$1,%rax
+	jnz		.L__zn_x		 ## if just a carry, then must be negative
+	movlpd	.L__real_ninf(%rip),%xmm0  # C99 specs -inf for +-0
+	ret
+.L__zn_x:
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+
+
+
+	.data
+	.align	16
+
+.L__real_one:			.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_two:			.quad 0x04000000000000000	# 2.0
+				.quad 0x04000000000000000
+.L__real_ninf:			.quad 0x0fff0000000000000	# -inf
+				.quad 0x0fff0000000000000
+.L__real_inf:			.quad 0x07ff0000000000000	# +inf
+				.quad 0x07ff0000000000000
+.L__real_nan:			.quad 0x07ff8000000000000	# NaN
+				.quad 0x07ff8000000000000
+
+.L__real_sign:			.quad 0x08000000000000000	# sign bit
+				.quad 0x08000000000000000
+.L__real_notsign:		.quad 0x07ffFFFFFFFFFFFFF	# ^sign bit
+				.quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold:		.quad 0x03F9EB85000000000	# .03
+				.quad 0x03F9EB85000000000
+.L__real_qnanbit:		.quad 0x00008000000000000	# quiet nan bit
+				.quad 0x00008000000000000
+.L__real_mant:			.quad 0x0000FFFFFFFFFFFFF	# mantissa bits
+				.quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000:	.quad 0x03f80000000000000	# /* 0.0078125 = 1/128 */
+				.quad 0x03f80000000000000
+.L__mask_1023:			.quad 0x000000000000003ff	#
+				.quad 0x000000000000003ff
+.L__mask_040:			.quad 0x00000000000000040	#
+				.quad 0x00000000000000040
+.L__mask_001:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001
+
+.L__real_ca1:			.quad 0x03fb55555555554e6	# 8.33333333333317923934e-02
+				.quad 0x03fb55555555554e6
+.L__real_ca2:			.quad 0x03f89999999bac6d4	# 1.25000000037717509602e-02
+				.quad 0x03f89999999bac6d4
+.L__real_ca3:			.quad 0x03f62492307f1519f	# 2.23213998791944806202e-03
+				.quad 0x03f62492307f1519f
+.L__real_ca4:			.quad 0x03f3c8034c85dfff0	# 4.34887777707614552256e-04
+				.quad 0x03f3c8034c85dfff0
+
+.L__real_cb1:			.quad 0x03fb5555555555557	# 8.33333333333333593622e-02
+				.quad 0x03fb5555555555557
+.L__real_cb2:			.quad 0x03f89999999865ede	# 1.24999999978138668903e-02
+				.quad 0x03f89999999865ede
+.L__real_cb3:			.quad 0x03f6249423bd94741	# 2.23219810758559851206e-03
+				.quad 0x03f6249423bd94741
+.L__real_log2_lead:  		.quad 0x03fe62e42e0000000	# log2_lead	  6.93147122859954833984e-01
+				.quad 0x03fe62e42e0000000
+.L__real_log2_tail: 		.quad 0x03e6efa39ef35793c	# log2_tail	  5.76999904754328540596e-08
+				.quad 0x03e6efa39ef35793c
+
+.L__real_half:			.quad 0x03fe0000000000000	# 1/2
+				.quad 0x03fe0000000000000
+.L__real_log2e_lead:		.quad 0x03FF7154400000000	# log2e_lead	  1.44269180297851562500E+00
+						.quad 0x03FF7154400000000
+.L__real_log2e_tail :		.quad 0x03ECB295C17F0BBBE	# log2e_tail	  3.23791044778235969970E-06
+						.quad 0x03ECB295C17F0BBBE
+.L__mask_lower:			.quad 0x0ffffffff00000000
+						.quad 0x0ffffffff00000000
+	.align	16
+
+.L__np_ln_lead_table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3f8fc0a800000000		# 1.55041813850402832031e-02
+	.quad	0x3f9f829800000000		# 3.07716131210327148438e-02
+	.quad	0x3fa7745800000000		# 4.58095073699951171875e-02
+	.quad	0x3faf0a3000000000		# 6.06245994567871093750e-02
+	.quad	0x3fb341d700000000		# 7.52233862876892089844e-02
+	.quad	0x3fb6f0d200000000		# 8.96121263504028320312e-02
+	.quad	0x3fba926d00000000		# 1.03796780109405517578e-01
+	.quad	0x3fbe270700000000		# 1.17783010005950927734e-01
+	.quad	0x3fc0d77e00000000		# 1.31576299667358398438e-01
+	.quad	0x3fc2955280000000		# 1.45181953907012939453e-01
+	.quad	0x3fc44d2b00000000		# 1.58604979515075683594e-01
+	.quad	0x3fc5ff3000000000		# 1.71850204467773437500e-01
+	.quad	0x3fc7ab8900000000		# 1.84922337532043457031e-01
+	.quad	0x3fc9525a80000000		# 1.97825729846954345703e-01
+	.quad	0x3fcaf3c900000000		# 2.10564732551574707031e-01
+	.quad	0x3fcc8ff780000000		# 2.23143517971038818359e-01
+	.quad	0x3fce270700000000		# 2.35566020011901855469e-01
+	.quad	0x3fcfb91800000000		# 2.47836112976074218750e-01
+	.quad	0x3fd0a324c0000000		# 2.59957492351531982422e-01
+	.quad	0x3fd1675c80000000		# 2.71933674812316894531e-01
+	.quad	0x3fd22941c0000000		# 2.83768117427825927734e-01
+	.quad	0x3fd2e8e280000000		# 2.95464158058166503906e-01
+	.quad	0x3fd3a64c40000000		# 3.07025015354156494141e-01
+	.quad	0x3fd4618bc0000000		# 3.18453729152679443359e-01
+	.quad	0x3fd51aad80000000		# 3.29753279685974121094e-01
+	.quad	0x3fd5d1bd80000000		# 3.40926527976989746094e-01
+	.quad	0x3fd686c800000000		# 3.51976394653320312500e-01
+	.quad	0x3fd739d7c0000000		# 3.62905442714691162109e-01
+	.quad	0x3fd7eaf800000000		# 3.73716354370117187500e-01
+	.quad	0x3fd89a3380000000		# 3.84411692619323730469e-01
+	.quad	0x3fd9479400000000		# 3.94993782043457031250e-01
+	.quad	0x3fd9f323c0000000		# 4.05465066432952880859e-01
+	.quad	0x3fda9cec80000000		# 4.15827870368957519531e-01
+	.quad	0x3fdb44f740000000		# 4.26084339618682861328e-01
+	.quad	0x3fdbeb4d80000000		# 4.36236739158630371094e-01
+	.quad	0x3fdc8ff7c0000000		# 4.46287095546722412109e-01
+	.quad	0x3fdd32fe40000000		# 4.56237375736236572266e-01
+	.quad	0x3fddd46a00000000		# 4.66089725494384765625e-01
+	.quad	0x3fde744240000000		# 4.75845873355865478516e-01
+	.quad	0x3fdf128f40000000		# 4.85507786273956298828e-01
+	.quad	0x3fdfaf5880000000		# 4.95077252388000488281e-01
+	.quad	0x3fe02552a0000000		# 5.04556000232696533203e-01
+	.quad	0x3fe0723e40000000		# 5.13945698738098144531e-01
+	.quad	0x3fe0be72e0000000		# 5.23248136043548583984e-01
+	.quad	0x3fe109f380000000		# 5.32464742660522460938e-01
+	.quad	0x3fe154c3c0000000		# 5.41597247123718261719e-01
+	.quad	0x3fe19ee6a0000000		# 5.50647079944610595703e-01
+	.quad	0x3fe1e85f40000000		# 5.59615731239318847656e-01
+	.quad	0x3fe23130c0000000		# 5.68504691123962402344e-01
+	.quad	0x3fe2795e00000000		# 5.77315330505371093750e-01
+	.quad	0x3fe2c0e9e0000000		# 5.86049020290374755859e-01
+	.quad	0x3fe307d720000000		# 5.94707071781158447266e-01
+	.quad	0x3fe34e2880000000		# 6.03290796279907226562e-01
+	.quad	0x3fe393e0c0000000		# 6.11801505088806152344e-01
+	.quad	0x3fe3d90260000000		# 6.20240390300750732422e-01
+	.quad	0x3fe41d8fe0000000		# 6.28608644008636474609e-01
+	.quad	0x3fe4618bc0000000		# 6.36907458305358886719e-01
+	.quad	0x3fe4a4f840000000		# 6.45137906074523925781e-01
+	.quad	0x3fe4e7d800000000		# 6.53301239013671875000e-01
+	.quad	0x3fe52a2d20000000		# 6.61398470401763916016e-01
+	.quad	0x3fe56bf9c0000000		# 6.69430613517761230469e-01
+	.quad	0x3fe5ad4040000000		# 6.77398800849914550781e-01
+	.quad	0x3fe5ee02a0000000		# 6.85303986072540283203e-01
+	.quad	0x3fe62e42e0000000		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_tail_table:
+	.quad	0x00000000000000000 # 0	; 0.00000000000000000000e+00
+	.quad	0x03e361f807c79f3db		# 5.15092497094772879206e-09
+	.quad	0x03e6873c1980267c8		# 4.55457209735272790188e-08
+	.quad	0x03e5ec65b9f88c69e		# 2.86612990859791781788e-08
+	.quad	0x03e58022c54cc2f99		# 2.23596477332056055352e-08
+	.quad	0x03e62c37a3a125330		# 3.49498983167142274770e-08
+	.quad	0x03e615cad69737c93		# 3.23392843005887000414e-08
+	.quad	0x03e4d256ab1b285e9		# 1.35722380472479366661e-08
+	.quad	0x03e5b8abcb97a7aa2		# 2.56504325268044191098e-08
+	.quad	0x03e6f34239659a5dc		# 5.81213608741512136843e-08
+	.quad	0x03e6e07fd48d30177		# 5.59374849578288093334e-08
+	.quad	0x03e6b32df4799f4f6		# 5.06615629004996189970e-08
+	.quad	0x03e6c29e4f4f21cf8		# 5.24588857848400955725e-08
+	.quad	0x03e1086c848df1b59		# 9.61968535632653505972e-10
+	.quad	0x03e4cf456b4764130		# 1.34829655346594463137e-08
+	.quad	0x03e63a02ffcb63398		# 3.65557749306383026498e-08
+	.quad	0x03e61e6a6886b0976		# 3.33431709374069198903e-08
+	.quad	0x03e6b8abcb97a7aa2		# 5.13008650536088382197e-08
+	.quad	0x03e6b578f8aa35552		# 5.09285070380306053751e-08
+	.quad	0x03e6139c871afb9fc		# 3.20853940845502057341e-08
+	.quad	0x03e65d5d30701ce64		# 4.06713248643004200446e-08
+	.quad	0x03e6de7bcb2d12142		# 5.57028186706125221168e-08
+	.quad	0x03e6d708e984e1664		# 5.48356693724804282546e-08
+	.quad	0x03e556945e9c72f36		# 1.99407553679345001938e-08
+	.quad	0x03e20e2f613e85bda		# 1.96585517245087232086e-09
+	.quad	0x03e3cb7e0b42724f6		# 6.68649386072067321503e-09
+	.quad	0x03e6fac04e52846c7		# 5.89936034642113390002e-08
+	.quad	0x03e5e9b14aec442be		# 2.85038578721554472484e-08
+	.quad	0x03e6b5de8034e7126		# 5.09746772910284482606e-08
+	.quad	0x03e6dc157e1b259d3		# 5.54234668933210171467e-08
+	.quad	0x03e3b05096ad69c62		# 6.29100830926604004874e-09
+	.quad	0x03e5c2116faba4cdd		# 2.61974119468563937716e-08
+	.quad	0x03e665fcc25f95b47		# 4.16752115011186398935e-08
+	.quad	0x03e5a9a08498d4850		# 2.47747534460820790327e-08
+	.quad	0x03e6de647b1465f77		# 5.56922172017964209793e-08
+	.quad	0x03e5da71b7bf7861d		# 2.76162876992552906035e-08
+	.quad	0x03e3e6a6886b09760		# 7.08169709942321478061e-09
+	.quad	0x03e6f0075eab0ef64		# 5.77453510221151779025e-08
+	.quad	0x03e33071282fb989b		# 4.43021445893361960146e-09
+	.quad	0x03e60eb43c3f1bed2		# 3.15140984357495864573e-08
+	.quad	0x03e5faf06ecb35c84		# 2.95077445089736670973e-08
+	.quad	0x03e4ef1e63db35f68		# 1.44098510263167149349e-08
+	.quad	0x03e469743fb1a71a5		# 1.05196987538551827693e-08
+	.quad	0x03e6c1cdf404e5796		# 5.23641361722697546261e-08
+	.quad	0x03e4094aa0ada625e		# 7.72099925253243069458e-09
+	.quad	0x03e6e2d4c96fde3ec		# 5.62089493829364197156e-08
+	.quad	0x03e62f4d5e9a98f34		# 3.53090261098577946927e-08
+	.quad	0x03e6467c96ecc5cbe		# 3.80080516835568242269e-08
+	.quad	0x03e6e7040d03dec5a		# 5.66961038386146408282e-08
+	.quad	0x03e67bebf4282de36		# 4.42287063097349852717e-08
+	.quad	0x03e6289b11aeb783f		# 3.45294525105681104660e-08
+	.quad	0x03e5a891d1772f538		# 2.47132034530447431509e-08
+	.quad	0x03e634f10be1fb591		# 3.59655343422487209774e-08
+	.quad	0x03e6d9ce1d316eb93		# 5.51581770357780862071e-08
+	.quad	0x03e63562a19a9c442		# 3.60171867511861372793e-08
+	.quad	0x03e54e2adf548084c		# 1.94511067964296180547e-08
+	.quad	0x03e508ce55cc8c97a		# 1.54137376631349347838e-08
+	.quad	0x03e30e2f613e85bda		# 3.93171034490174464173e-09
+	.quad	0x03e6db03ebb0227bf		# 5.52990607758839766440e-08
+	.quad	0x03e61b75bb09cb098		# 3.29990737637586136511e-08
+	.quad	0x03e496f16abb9df22		# 1.18436010922446096216e-08
+	.quad	0x03e65b3f399411c62		# 4.04248680368301346709e-08
+	.quad	0x03e586b3e59f65355		# 2.27418915900284316293e-08
+	.quad	0x03e52482ceae1ac12		# 1.70263791333409206020e-08
+	.quad	0x03e6efa39ef35793c		# 5.76999904754328540596e-08
+	.quad 0					# for alignment
+
+

diff --git a/src/gas/vrd2sin.S b/src/gas/vrd2sin.S
new file mode 100644
index 0000000..50c0deb
--- /dev/null
+++ b/src/gas/vrd2sin.S

@@ -0,0 +1,805 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# A vector implementation of the libm sin function.
+#
+# Prototype:
+#
+#     __m128d __vrd2_sin(__m128d x);
+#
+#   Computes Sine of x
+#   It will provide proper C99 return values,
+#   but may not raise floating point status bits properly.
+#   Based on the NAG C implementation.
+#   Author: Harsha Jagasia
+#   Email:  harsha.jagasia@amd.com
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+.data
+.align 16
+.L__real_7fffffffffffffff: 	.quad 0x07fffffffffffffff	#Sign bit zero
+				.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: 	.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+				.quad 0x03e40000000000000
+.L__real_3fe0000000000000: 	.quad 0x03fe0000000000000	# 0.5
+				.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: 	.quad 0x03fc5555555555555	# 0.166666666666
+				.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: 	.quad 0x03fe45f306dc9c883	# twobypi
+				.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: 	.quad 0x03ff921fb54400000	# piby2_1
+				.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: 	.quad 0x03dd0b4611a626331	# piby2_1tail
+				.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: 	.quad 0x03dd0b4611a600000	# piby2_2
+				.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: 	.quad 0x03ba3198a2e037073	# piby2_2tail
+				.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: 	.quad 0x0fffffffff8000000	# mask for stripping head and tail
+				.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+				.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+				.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+				.quad 0
+.L__reald_one_zero:		.quad 0x00000000100000000	# sin_cos_filter
+				.quad 0
+.L__reald_zero_one:		.quad 0x00000000000000001	#
+				.quad 0
+.L__reald_two_zero:		.quad 0x00000000200000000	#
+				.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+				.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+				.quad 0x03ff0000000000000	#
+.L__real_zero:			.quad 0x00000000000000000	#
+				.quad 0x00000000000000000	#
+.L__real_one:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.L__real_ffffffffffffffff: 	.quad 0x0ffffffffffffffff	#Sign bit one
+				.quad 0x0ffffffffffffffff
+
+.Lcosarray:
+	.quad	0x03fa5555555555555		#  0.0416667		   	c1
+	.quad	0x03fa5555555555555
+	.quad	0x0bf56c16c16c16967		# -0.00138889	   		c2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03efa01a019f4ec90		#  2.48016e-005			c3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0be927e4fa17f65f6		# -2.75573e-007			c4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03e21eeb69037ab78		#  2.08761e-009			c5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0bda907db46cc5e42		# -1.13826e-011	   		c6
+	.quad	0x0bda907db46cc5e42
+.Lsinarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bfc5555555555555
+	.quad	0x03f81111111110bb3		#  0.00833333	   		s2
+	.quad	0x03f81111111110bb3
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0bf2a01a019e83e5c
+	.quad	0x03ec71de3796cde01		#  2.75573e-006			s4
+	.quad	0x03ec71de3796cde01
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0be5ae600b42fdfa7
+	.quad	0x03de5e0b2f9a43bb8		#  1.59181e-010	   		s6
+	.quad	0x03de5e0b2f9a43bb8
+.Lsincosarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x03fa5555555555555		#  0.0416667		   	c1
+	.quad	0x03f81111111110bb3		#  0.00833333	   		s2
+	.quad	0x0bf56c16c16c16967		#				c2
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x03ec71de3796cde01		#  2.75573e-006			s4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x03de5e0b2f9a43bb8		#  1.59181e-010	   		s6
+	.quad	0x0bda907db46cc5e42
+.Lcossinarray:
+	.quad	0x03fa5555555555555		#  0.0416667		   	c1
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bf56c16c16c16967		#				c2
+	.quad	0x03f81111111110bb3		#  0.00833333	   		s2
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03ec71de3796cde01		#  2.75573e-006			s4
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0bda907db46cc5e42
+	.quad	0x03de5e0b2f9a43bb8		#  1.59181e-010	   		s6
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+    .text
+    .align 16
+    .p2align 4,,15
+
+.equ	p_temp,	0x00		# temporary for get/put bits operation
+.equ	p_temp1,0x10		# temporary for get/put bits operation
+.equ	p_temp2,0x20		# temporary for get/put bits operation
+.equ	p_xmm6,	0x30		# temporary for get/put bits operation
+.equ	p_xmm7,	0x40		# temporary for get/put bits operation
+.equ	p_xmm8,	0x50		# temporary for get/put bits operation
+.equ	p_xmm9,	0x60		# temporary for get/put bits operation
+.equ	p_xmm10,0x70		# temporary for get/put bits operation
+.equ	p_xmm11,0x80		# temporary for get/put bits operation
+.equ	p_xmm12,0x90		# temporary for get/put bits operation
+.equ	p_xmm13,0x0A0		# temporary for get/put bits operation
+.equ	p_xmm14,0x0B0		# temporary for get/put bits operation
+.equ	p_xmm15,0x0C0		# temporary for get/put bits operation
+.equ	r,	0x0D0		# pointer to r for remainder_piby2
+.equ	rr,	0x0E0		# pointer to r for remainder_piby2
+.equ	region,	0x0F0		# pointer to r for remainder_piby2
+.equ	p_original,0x100	# original x
+.equ	p_mask,	0x110		# original x
+.equ	p_sign,	0x120		# original x
+
+.globl __vrd2_sin
+    .type   __vrd2_sin,@function
+__vrd2_sin:
+
+	sub		$0x138,%rsp
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+movdqa 	%xmm0,%xmm6					#move to mem to get into integer regs **
+andpd 	.L__real_7fffffffffffffff(%rip), %xmm0		#Unsign			-
+
+movd	%xmm0,%rax					#rax is lower arg	+
+movhpd	%xmm0, p_temp+8(%rsp)				#			+
+mov    	p_temp+8(%rsp),%rcx				#rcx = upper arg	+
+movdqa	%xmm0,%xmm1
+
+							#This will mask all nan/infs also
+pcmpgtd		%xmm6,%xmm1
+movdqa		%xmm1,%xmm6
+psrldq		$4, %xmm1
+psrldq		$8, %xmm6
+
+mov 	$0x3FE921FB54442D18,%rdx			#piby4	+
+mov	$0x411E848000000000,%r10			#5e5	+
+
+movapd	.L__real_3fe0000000000000(%rip), %xmm5		#0.5 for later use	+
+
+por	%xmm1,%xmm6
+movd	%xmm6,%r11					#Move Sign to gpr **
+
+movapd	%xmm0,%xmm2					#x	+
+movapd	%xmm0,%xmm4					#x	+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+
+	cmp	%r10,%rax				#is lower arg >= 5e5
+	jae	.Llower_or_both_arg_gt_5e5
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lupper_arg_gt_5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lboth_arg_lt_than_5e5:
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2	# x*twobypi
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm3	# xmm3=piby2_1
+	addpd	%xmm5,%xmm2				# xmm2 = npi2 = x*twobypi+0.5
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm1	# xmm1=piby2_2
+	movapd	.L__real_3ba3198a2e037073(%rip),%xmm6	# xmm6=piby2_2tail
+	cvttpd2dq	%xmm2,%xmm0			# xmm0=convert npi2 to ints
+	cvtdq2pd	%xmm0,%xmm2			# xmm2=and back to double.
+
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulpd	%xmm2,%xmm3				# npi2 * piby2_1
+	subpd	%xmm3,%xmm4				# xmm4 = rhead=x-npi2*piby2_1
+
+#t  = rhead;
+       movapd	%xmm4,%xmm5				# xmm5=t=rhead
+
+#rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm1				# xmm1= npi2*piby2_2
+
+#rhead  = t - rtail;
+       subpd	%xmm1,%xmm4				# xmm4= rhead = t-rtail
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	%xmm2,%xmm6     				# npi2 * piby2_2tail
+       subpd	%xmm4,%xmm5				# t-rhead
+       subpd	%xmm5,%xmm1				# rtail-(t - rhead)
+       addpd	%xmm6,%xmm1				# rtail=npi2*piby2_2+(rtail-(t-rhead))
+
+#r = rhead - rtail
+#rr=(rhead-r) -rtail
+#Sign
+#Region
+	movdqa		%xmm0,%xmm5			# Region			+
+	movd		%xmm0,%r10			# Sign
+	movdqa		%xmm4,%xmm0			# rhead (handle xmm0 retype)	+
+
+	subpd	%xmm1,%xmm0				# rhead - rtail			+
+	pand 	.L__reald_one_one(%rip),%xmm5		# Odd/Even region for Cos/Sin	+
+	mov 	.L__reald_one_zero(%rip),%r9		# Compare value for cossin	+
+	subpd	%xmm0,%xmm4				# rr=rhead-r			+
+	movd 	%xmm5,%r8				# Region			+
+	movapd	%xmm0,%xmm2				# Move for x2			+
+	mulpd	%xmm0,%xmm2				# x2				+
+	subpd	%xmm1,%xmm4				# rr=(rhead-r) -rtail		+
+
+	shr	$1,%r10					#~AB+A~B, A is sign and B is upper bit of region
+	mov	%r10,%rcx
+	not 	%r11					#ADDED TO CHANGE THE LOGIC
+	and	%r11,%r10
+	not	%rcx
+	not	%r11
+	and	%r11,%rcx
+	or	%rcx,%r10
+	and	.L__reald_one_one(%rip),%r10			#(~AB+A~B)&1
+
+	mov	%r10,%r11
+	and	%r9,%r11				#mask out the lower sign bit leaving the upper sign bit
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$31,%r11				#shift upper sign bit left by 31 bits
+	mov 	 %r10,p_sign(%rsp)			#write out lower sign bit
+	mov 	 %r11,p_sign+8(%rsp)			#write out upper sign bit
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign
+
+.align 16
+.L__vrd2_sin_approximate:
+	cmp 	$0,%r8
+	jnz	.Lvrd2_not_sin_piby4
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lvrd2_sin_piby4:
+	movapd	.Lsinarray+0x50(%rip),%xmm3	# s6
+	movapd	.Lsinarray+0x20(%rip),%xmm5	# s3
+	movapd	%xmm2,%xmm1			# move for x4
+
+	mulpd	%xmm2,%xmm3			# x2s6
+	mulpd	%xmm2,%xmm5			# x2s3
+	mulpd	%xmm2,%xmm1			# x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm3	# s5+x2s6
+	movapd	%xmm2,%xmm6			# move for x3
+	addpd	.Lsinarray+0x10(%rip),%xmm5	# s2+x2s3
+
+	mulpd	%xmm2,%xmm3			# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm5			# x2(s2+x2s3)
+	mulpd	%xmm2,%xmm1			# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm3	# s4 + x2(s5+x2s6)
+	addpd	.Lsinarray(%rip),%xmm5		# s1+x2(s2+x2s3)
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2
+
+	mulpd	%xmm1,%xmm3			# x6(s4 + x2(s5+x2s6))
+	mulpd	%xmm0,%xmm6			# x3
+	addpd	%xmm5,%xmm3			# zs
+	mulpd	%xmm4,%xmm2			# 0.5 * x2 *xx
+
+	mulpd	%xmm3,%xmm6			# x3*zs
+	subpd	%xmm2,%xmm6			# x3*zs - 0.5 * x2 *xx
+	addpd	%xmm4,%xmm6			# +xx
+	addpd	%xmm6,%xmm0			# +x
+	xorpd	p_sign(%rsp),%xmm0	# xor sign
+	jmp .L__vrd2_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign
+.align 16
+.Lvrd2_not_sin_piby4:
+	cmp 	$1,%r8
+	jnz	.Lvrd2_not_sin_cos_piby4
+
+.Lvrd2_sin_cos_piby4:
+
+	movapd	 %xmm4,p_temp(%rsp)		# rr move to to memory
+	movapd	 %xmm0,p_temp1(%rsp)		# r move to to memory
+
+	movapd	.Lcossinarray+0x50(%rip),%xmm3		# s6
+	mulpd	%xmm2,%xmm3				# x2s6
+	movdqa	.Lcossinarray+0x20(%rip),%xmm5		# s3
+	movapd	%xmm2,%xmm1				# move x2 for x4
+	mulpd	%xmm2,%xmm1				# x4
+	mulpd	%xmm2,%xmm5				# x2s3
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm3		# s5+x2s6
+	movapd	%xmm2,%xmm4				# move for x6
+	mulpd	%xmm2,%xmm3				# x2(s5+x2s6)
+	mulpd	%xmm1,%xmm4				# x6
+	addpd	.Lcossinarray+0x10(%rip),%xmm5		# s2+x2s3
+	mulpd	%xmm2,%xmm5				# x2(s2+x2s3)
+	addpd	.Lcossinarray+0x30(%rip),%xmm3		# s4 + x2(s5+x2s6)
+
+	movhlps	%xmm0,%xmm0				# high of x for x3
+	mulpd	%xmm4,%xmm3				# x6(s4 + x2(s5+x2s6))
+	addpd	.Lcossinarray(%rip),%xmm5		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm2,%xmm4				# high of x2 for x3
+	addpd	%xmm5,%xmm3				# z
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2
+	mulsd	%xmm0,%xmm4				# x3 #
+	movhlps	%xmm3,%xmm5				# xmm5 = sin
+							# xmm3 = cos
+
+	mulsd	%xmm4,%xmm5				# sin*x3 #
+	movsd	.L__real_3ff0000000000000(%rip),%xmm4	# 1.0 #
+	mulsd	%xmm1,%xmm3				# cos*x4 #
+
+	subsd	%xmm2,%xmm4 				# t=1.0-r #
+
+	movhlps	%xmm2,%xmm6				# move 0.5 * x2 for 0.5 * x2 * xx #
+	mulsd	p_temp+8(%rsp),%xmm6			# 0.5 * x2 * xx #
+	subsd	%xmm6,%xmm5				# sin - 0.5 * x2 *xx #
+	addsd	p_temp+8(%rsp),%xmm5			# sin+xx #
+
+	movlpd	p_temp1(%rsp),%xmm6			# x
+	mulsd	p_temp(%rsp),%xmm6			# x *xx #
+
+	movsd	.L__real_3ff0000000000000(%rip),%xmm1	# 1 #
+	subsd	%xmm4,%xmm1				# 1 -t #
+	addsd	%xmm5,%xmm0				# sin+x #
+	subsd	%xmm2,%xmm1				# (1-t) - r #
+	subsd	%xmm6,%xmm1				# ((1 + (-t)) - r) - x*xx #
+	addsd   %xmm1,%xmm3				# cos+((1 + (-t)) - r) - x*xx #
+	addsd   %xmm4,%xmm3				# cos+t #
+
+	movapd	p_sign(%rsp),%xmm2			# load sign
+	movlhps %xmm0,%xmm3
+	movapd	%xmm3,%xmm0
+	xorpd	%xmm2,%xmm0
+	jmp .L__vrd2_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign
+.align 16
+.Lvrd2_not_sin_cos_piby4:
+	cmp 	%r9,%r8
+	jnz	.Lvrd2_cos_piby4
+
+.Lvrd2_cos_sin_piby4:
+
+	movapd	 %xmm4,p_temp(%rsp)		# Store rr
+	movapd	.Lsincosarray+0x50(%rip),%xmm3		# s6
+	mulpd	%xmm2,%xmm3				# x2s6
+	movdqa	.Lsincosarray+0x20(%rip),%xmm5		# s3 (handle xmm5 retype)
+	movapd	%xmm2,%xmm1				# move x2 for x4
+	mulpd	%xmm2,%xmm1				# x4
+	mulpd	%xmm2,%xmm5				# x2s3
+	addpd	.Lsincosarray+0x40(%rip),%xmm3		# s5+x2s6
+	movapd	%xmm2,%xmm4				# move x2 for x6
+	mulpd	%xmm2,%xmm3				# x2(s5+x2s6)
+	mulpd	%xmm1,%xmm4				# x6
+	addpd	.Lsincosarray+0x10(%rip),%xmm5		# s2+x2s3
+	mulpd	%xmm2,%xmm5				# x2(s2+x2s3)
+	addpd	.Lsincosarray+0x30(%rip),%xmm3		# s4+x2(s5+x2s6)
+
+	movhlps	%xmm1,%xmm1				# move high x4 for cos
+	mulpd	%xmm4,%xmm3				# x6(s4+x2(s5+x2s6))
+	addpd	.Lsincosarray(%rip),%xmm5		# s1+x2(s2+x2s3)
+	movapd	%xmm2,%xmm4				# move low x2 for x3
+	mulsd	%xmm0,%xmm4				# get low x3 for sin term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2
+
+	addpd	%xmm3,%xmm5				# z
+	movhlps	%xmm2,%xmm6				# move high r for cos
+	movhlps	%xmm5,%xmm3				# xmm5 = sin
+							# xmm3 = cos
+	mulsd	p_temp(%rsp),%xmm2		# 0.5 * x2 * xx
+
+	mulsd	%xmm4,%xmm5				# sin *x3
+	movsd	.L__real_3ff0000000000000(%rip),%xmm4	# 1.0
+	mulsd	%xmm1,%xmm3				# cos *x4
+	subsd	%xmm6,%xmm4 				# t=1.0-r
+
+	movhlps	%xmm0,%xmm1
+	subsd	%xmm2,%xmm5				# sin - 0.5 * x2 *xx
+
+	mulsd	p_temp+8(%rsp),%xmm1			# x * xx
+	movsd	.L__real_3ff0000000000000(%rip),%xmm2	# 1
+	subsd	%xmm4,%xmm2				# 1 - t
+	addsd	p_temp(%rsp),%xmm5			# sin+xx
+
+	subsd	%xmm6,%xmm2				# (1-t) - r
+	subsd	%xmm1,%xmm2				# ((1 + (-t)) - r) - x*xx
+	addsd	%xmm5,%xmm0				# sin + x
+	addsd   %xmm2,%xmm3				# cos+((1-t)-r - x*xx)
+	addsd   %xmm4,%xmm3				# cos+t
+
+	movapd	p_sign(%rsp),%xmm5		# load sign
+	movlhps %xmm3,%xmm0
+	xorpd	%xmm5,%xmm0
+	jmp .L__vrd2_sin_cleanup
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+
+.Lvrd2_cos_piby4:
+	mulpd	%xmm0,%xmm4				# x*xx
+	movdqa	.L__real_3fe0000000000000(%rip),%xmm5	# 0.5 (handle xmm5 retype)
+	movapd	.Lcosarray+0x50(%rip),%xmm1		# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm0		# c3
+	mulpd	%xmm2,%xmm5				# r = 0.5 *x2
+	movapd	%xmm2,%xmm3				# copy of x2 for x4
+	movapd	 %xmm4,p_temp(%rsp)		# store x*xx
+	mulpd	%xmm2,%xmm1				# c6*x2
+	mulpd	%xmm2,%xmm0				# c3*x2
+	subpd	.L__real_3ff0000000000000(%rip),%xmm5	# -t=r-1.0
+	mulpd	%xmm2,%xmm3				# x4
+	addpd	.Lcosarray+0x40(%rip),%xmm1		# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm0		# c2+x2C3
+	addpd   .L__real_3ff0000000000000(%rip),%xmm5	# 1 + (-t)
+	mulpd	%xmm2,%xmm3				# x6
+	mulpd	%xmm2,%xmm1				# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm0				# x2(c2+x2C3)
+	movapd 	%xmm2,%xmm4				# copy of x2
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm4	# r = 0.5 *x2
+	addpd	.Lcosarray+0x30(%rip),%xmm1		# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm0			# c1+x2(c2+x2C3)
+	mulpd	%xmm2,%xmm2				# x4
+	subpd   %xmm4,%xmm5				# (1 + (-t)) - r
+	mulpd	%xmm3,%xmm1				# x6(c4 + x2(c5+x2c6))
+	addpd	%xmm1,%xmm0				# zc
+	subpd	.L__real_3ff0000000000000(%rip),%xmm4	# -t=r-1.0
+	subpd	p_temp(%rsp),%xmm5		# ((1 + (-t)) - r) - x*xx
+	mulpd	%xmm2,%xmm0				# x4 * zc
+	addpd   %xmm5,%xmm0				# x4 * zc + ((1 + (-t)) - r -x*xx)
+	subpd   %xmm4,%xmm0				# result - (-t)
+	xorpd	p_sign(%rsp),%xmm0		# xor with sign
+	jmp 	.L__vrd2_sin_cleanup
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Llower_or_both_arg_gt_5e5:
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+
+	movlpd	 %xmm0,r(%rsp)			#Save lower fp arg for remainder_piby2 call
+
+	movhlps	%xmm0,%xmm0			#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm4,%xmm4
+
+# Work on Upper arg
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+
+#If upper Arg is <=piby4
+       cmp	%rdx,%rcx					# is upper arg > piby4
+       ja	0f
+
+       mov 	 $0,%ecx						# region = 0
+       mov	 %ecx,region+4(%rsp)				# store upper region
+       movlpd	 %xmm0,r+8(%rsp)					# store upper r (unsigned - sign is adjusted later based on sign)
+       xorpd	 %xmm4,%xmm4					# rr = 0
+       movlpd	 %xmm4,rr+8(%rsp)				# store upper rr
+       jmp	.Lcheck_lower_arg
+
+#If upper Arg is > piby4
+.align 16
+0:
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm5,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm3		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# xmm0 = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm1		# xmm1 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+	#/* Subtract the multiple from x to get an extra-precision remainder */
+	#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm3					# npi2 * piby2_1
+	subsd	%xmm3,%xmm4					# xmm4 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+	#t  = rhead;
+       movsd	%xmm4,%xmm5					# xmm5 = t = rhead
+
+	#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm1					# xmm1 =rtail=(npi2*piby2_2)
+
+	#rhead  = t - rtail
+       subsd	%xmm1,%xmm4					# xmm4 =rhead=(t-rtail)
+
+	#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm4,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm1					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm1					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+	#r =  rhead - rtail
+	#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	%xmm4,%xmm0
+       subsd	%xmm1,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm0,%xmm4					# rr=rhead-r
+       subsd	%xmm1,%xmm4					# xmm4 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r+8(%rsp)			# store upper r
+       movlpd	 %xmm4,rr+8(%rsp)			# store upper rr
+
+#If lower Arg is > 5e5
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_lower_arg:
+	mov		$0x07ff0000000000000,%r9			# is lower arg nan/inf
+	mov		%r9,%r10
+	and		%rax,%r10
+	cmp		%r9,%r10
+	jz		.L__vrd2_cos_lower_naninf
+
+	mov	 %r11,p_temp(%rsp)	#Save Sign
+
+
+	lea	 region(%rsp),%rdx	# lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+	movlpd	 r(%rsp),%xmm0		#Restore lower fp arg for remainder_piby2 call
+
+        call    __amd_remainder_piby2@PLT
+
+	mov	p_temp(%rsp),%r11	#Restore Sign
+
+	jmp 	.L__vrd2_sin_reconstruct
+
+.L__vrd2_cos_lower_naninf:
+	mov	r(%rsp),%rax
+	mov	$0x00008000000000000,%r9
+	or	%r9,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				# rr = 0
+	mov	 %r10d,region(%rsp)			# region =0
+
+	jmp 	.L__vrd2_sin_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lupper_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+	movhpd	%xmm0,r+8(%rsp)		#Save upper fp arg for remainder_piby2 call
+	movlhps	%xmm0,%xmm0		#Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+	movlhps	%xmm2,%xmm2
+	movlhps	%xmm4,%xmm4
+
+# Work on Lower arg
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+#If lower Arg is <=piby4
+       cmp	%rdx,%rax					# is upper arg > piby4
+       ja	0f
+
+       mov 	$0,%eax						# region = 0
+       mov	 %eax,region(%rsp)			# store upper region
+       movlpd	 %xmm0,r(%rsp)				# store upper r
+       xorpd	%xmm4,%xmm4					# rr = 0
+       movlpd	 %xmm4,rr(%rsp)				# store upper rr
+       jmp .Lcheck_upper_arg
+
+.align 16
+0:
+#If upper Arg is > piby4
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm5,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm3		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# xmm0 = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm1		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm3					# npi2 * piby2_1
+	subsd	%xmm3,%xmm4					# xmm4 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm4,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm1					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm1,%xmm4					# xmm4 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm4,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm1					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm1					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store lower region
+       movsd	%xmm4,%xmm0
+       subsd	%xmm1,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm0,%xmm4					# rr=rhead-r
+       subsd	%xmm1,%xmm4					# xmm4 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r(%rsp)				# store lower r
+       movlpd	 %xmm4,rr(%rsp)				# store lower rr
+
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_upper_arg:
+	mov		$0x07ff0000000000000,%r9			# is upper arg nan/inf
+	mov		%r9,%r10
+	and		%rcx,%r10
+	cmp		%r9,%r10
+	jz		.L__vrd2_cos_upper_naninf
+
+	mov	 %r11,p_temp(%rsp)	#Save Sign
+
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	p_temp(%rsp),%r11	#Restore Sign
+
+	jmp .L__vrd2_sin_reconstruct
+
+.L__vrd2_cos_upper_naninf:
+	mov	r+8(%rsp),%rcx				# upper arg is nan/inf
+	mov	$0x00008000000000000,%r9
+	or	%r9,%rcx
+	mov	%rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr+8(%rsp)				# rr = 0
+	mov	%r10d,region+4(%rsp)			# region =0
+	jmp 	.L__vrd2_sin_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+
+	movhpd	 %xmm0,p_temp2(%rsp) 			#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r9	#is lower arg nan/inf
+	mov		%r9,%r10
+	and		%rax,%r10
+	cmp		%r9,%r10
+	jz		.L__vrd2_cos_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r11,p_temp1(%rsp)			#Save Sign
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp1(%rsp),%r11			#Restore Sign
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	jmp 	0f
+
+.L__vrd2_cos_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+	movd	%xmm0,%rax
+	mov	$0x00008000000000000,%r9
+	or	%r9,%rax
+	mov	%rax,r(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr(%rsp)				#rr = 0
+	mov	%r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r9			#is upper arg nan/inf
+	mov		%r9,%r10
+	and		%rcx,%r10
+	cmp		%r9,%r10
+	jz		.L__vrd2_cos_upper_naninf_of_both_gt_5e5
+
+
+	mov	 %r11,p_temp(%rsp)	#Save Sign
+
+	lea	 region+4(%rsp),%rdx		#upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 p_temp2(%rsp),%xmm0		#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	p_temp(%rsp),%r11	#Restore Sign
+
+	jmp 	0f
+
+.L__vrd2_cos_upper_naninf_of_both_gt_5e5:
+	mov	 p_temp2(%rsp),%rcx			#upper arg is nan/inf
+	mov	 $0x00008000000000000,%r9
+	or	 %r9,%rcx
+	mov	 %rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	xor	 %r10,%r10
+	mov	 %r10,rr+8(%rsp)			#rr = 0
+	mov	 %r10d,region+4(%rsp)			#region = 0
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+0:
+.L__vrd2_sin_reconstruct:
+#Construct xmm0=x, xmm2 =x2, xmm4=xx, r8=region, xmm6=sign
+	movapd	r(%rsp),%xmm0				#x
+	movapd	%xmm0,%xmm2				#move for x2
+	mulpd	%xmm2,%xmm2				#x2
+	movapd	rr(%rsp),%xmm4				#xx
+
+	mov	region(%rsp),%r8
+	mov 	.L__reald_one_zero(%rip),%r9		#compare value for cossin path
+	mov 	%r8,%r10
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+
+	shr	$1,%r10					#~AB+A~B, A is sign and B is upper bit of region
+	mov	%r10,%rcx
+	not 	%r11					#ADDED TO CHANGE THE LOGIC
+	and	%r11,%r10
+	not	%rcx
+	not	%r11
+	and	%r11,%rcx
+	or	%rcx,%r10
+	and	.L__reald_one_one(%rip),%r10		#(~AB+A~B)&1
+
+	mov	%r10,%r11
+	and	%r9,%r11				#mask out the lower sign bit leaving the upper sign bit
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$31,%r11				#shift upper sign bit left by 31 bits
+	mov 	 %r10,p_sign(%rsp)			#write out lower sign bit
+	mov 	 %r11,p_sign+8(%rsp)			#write out upper sign bit
+
+	jmp .L__vrd2_sin_approximate
+#ENDMAIN
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd2_sin_cleanup:
+	add	$0x138,%rsp
+	ret
+

diff --git a/src/gas/vrd2sincos.S b/src/gas/vrd2sincos.S
new file mode 100644
index 0000000..b25bb37
--- /dev/null
+++ b/src/gas/vrd2sincos.S

@@ -0,0 +1,968 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# A vector implementation of the libm sincos function.
+#
+# Prototype:
+#
+#     __vrd2_sincos(__m128d x, __m128d* ys, __m128d* yc);
+#
+#   Computes Sine and Cosine of x.
+#   It will provide proper C99 return values,
+#   but may not raise floating point status bits properly.
+#   Based on the NAG C implementation.
+#   Author: Harsha Jagasia
+#   Email:  harsha.jagasia@amd.com
+
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+.data
+.align 16
+.L__real_7fffffffffffffff: 	.quad 0x07fffffffffffffff	#Sign bit zero
+				.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: 	.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+				.quad 0x03e40000000000000
+.L__real_3fe0000000000000: 	.quad 0x03fe0000000000000	# 0.5
+				.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: 	.quad 0x03fc5555555555555	# 0.166666666666
+				.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: 	.quad 0x03fe45f306dc9c883	# twobypi
+				.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: 	.quad 0x03ff921fb54400000	# piby2_1
+				.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: 	.quad 0x03dd0b4611a626331	# piby2_1tail
+				.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: 	.quad 0x03dd0b4611a600000	# piby2_2
+				.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: 	.quad 0x03ba3198a2e037073	# piby2_2tail
+				.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: 	.quad 0x0fffffffff8000000	# mask for stripping head and tail
+				.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+				.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+				.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+				.quad 0
+.L__reald_one_zero:		.quad 0x00000000100000000	# sin_cos_filter
+				.quad 0
+.L__reald_zero_one:		.quad 0x00000000000000001	#
+				.quad 0
+.L__reald_two_zero:		.quad 0x00000000200000000	#
+				.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+				.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+				.quad 0x03ff0000000000000	#
+.L__real_zero:			.quad 0x00000000000000000	#
+				.quad 0x00000000000000000	#
+.L__real_one:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.L__real_ffffffffffffffff: 	.quad 0x0ffffffffffffffff	#Sign bit one
+				.quad 0x0ffffffffffffffff
+.L__real_naninf_upper_sign_mask:	.quad 0x000000000ffffffff	#
+					.quad 0x000000000ffffffff	#
+.L__real_naninf_lower_sign_mask:	.quad 0x0ffffffff00000000	#
+					.quad 0x0ffffffff00000000	#
+
+.Lcosarray:
+	.quad	0x03fa5555555555555		#  0.0416667		   	c1
+	.quad	0x03fa5555555555555
+	.quad	0x0bf56c16c16c16967		# -0.00138889	   		c2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03efa01a019f4ec90		#  2.48016e-005			c3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0be927e4fa17f65f6		# -2.75573e-007			c4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03e21eeb69037ab78		#  2.08761e-009			c5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0bda907db46cc5e42		# -1.13826e-011	   		c6
+	.quad	0x0bda907db46cc5e42
+.Lsinarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bfc5555555555555
+	.quad	0x03f81111111110bb3		#  0.00833333	   		s2
+	.quad	0x03f81111111110bb3
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0bf2a01a019e83e5c
+	.quad	0x03ec71de3796cde01		#  2.75573e-006			s4
+	.quad	0x03ec71de3796cde01
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0be5ae600b42fdfa7
+	.quad	0x03de5e0b2f9a43bb8		#  1.59181e-010	   		s6
+	.quad	0x03de5e0b2f9a43bb8
+.Lsincosarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x03fa5555555555555		#  0.0416667		   	c1
+	.quad	0x03f81111111110bb3		#  0.00833333	   		s2
+	.quad	0x0bf56c16c16c16967		#				c2
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x03ec71de3796cde01		#  2.75573e-006			s4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x03de5e0b2f9a43bb8		#  1.59181e-010	   		s6
+	.quad	0x0bda907db46cc5e42
+.Lcossinarray:
+	.quad	0x03fa5555555555555		#  0.0416667		   	c1
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bf56c16c16c16967		#				c2
+	.quad	0x03f81111111110bb3		#  0.00833333	   		s2
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03ec71de3796cde01		#  2.75573e-006			s4
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0bda907db46cc5e42
+	.quad	0x03de5e0b2f9a43bb8		#  1.59181e-010	   		s6
+
+
+.text
+.align 16
+.p2align 4,,15
+
+.equ	p_temp,		0x00		# temporary for get/put bits operation
+.equ	p_temp1,	0x10		# temporary for get/put bits operation
+.equ	p_temp2,	0x20		# temporary for get/put bits operation
+
+.equ	save_xmm6,	0x30		# temporary for get/put bits operation
+.equ	save_xmm7,	0x40		# temporary for get/put bits operation
+.equ	save_xmm8,	0x50		# temporary for get/put bits operation
+.equ	save_xmm9,	0x60		# temporary for get/put bits operation
+.equ	save_xmm10,	0x70		# temporary for get/put bits operation
+.equ	save_xmm11,	0x80		# temporary for get/put bits operation
+.equ	save_xmm12,	0x90		# temporary for get/put bits operation
+.equ	save_xmm13,	0x0A0		# temporary for get/put bits operation
+.equ	save_xmm14,	0x0B0		# temporary for get/put bits operation
+.equ	save_xmm15,	0x0C0		# temporary for get/put bits operation
+
+.equ	save_rdi,	0x0D0
+.equ	save_rsi,	0x0E0
+
+.equ	r,		0x0F0		# pointer to r for remainder_piby2
+.equ	rr,		0x0100		# pointer to r for remainder_piby2
+.equ	region,		0x0110		# pointer to r for remainder_piby2
+
+.equ	p_original,	0x0120		# original x
+.equ	p_mask,		0x0130		# original x
+.equ	p_sign,		0x0140		# original x
+.equ	p_sign1,	0x0150		# original x
+.equ	p_x,		0x0160		#x
+.equ	p_xx,		0x0170		#xx
+.equ	p_x2,		0x0180		#x2
+.equ	p_sin,		0x0190		#sin
+.equ	p_cos,		0x01A0		#cos
+.equ	p_temp2,	0x01B0		# temporary for get/put bits operation
+
+.globl __vrd2_sincos
+    .type   __vrd2_sincos,@function
+__vrd2_sincos:
+	sub		$0x1C8,%rsp
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+movdqa 	%xmm0,%xmm6				#move to mem to get into integer regs **
+movdqa  %xmm0, p_original(%rsp)			#move to mem to get into integer regs -
+
+andpd 	.L__real_7fffffffffffffff(%rip),%xmm0		#Unsign			-
+
+mov	%rdi, p_sin(%rsp)		# save address for sin return
+mov	%rsi, p_cos(%rsp)		# save address for cos return
+
+movd	%xmm0,%rax				#rax is lower arg
+movhpd	%xmm0, p_temp+8(%rsp)			#
+mov    	p_temp+8(%rsp),%rcx			#rcx = upper arg
+movdqa	%xmm0,%xmm8
+
+pcmpgtd		%xmm6,%xmm8
+movdqa		%xmm8,%xmm6
+psrldq		$4,%xmm8
+psrldq		$8,%xmm6
+
+mov 	$0x3FE921FB54442D18,%rdx			#piby4
+mov	$0x411E848000000000,%r10			#5e5
+movapd	.L__real_3fe0000000000000(%rip),%xmm4			#0.5 for later use
+
+por	%xmm6,%xmm8
+movd	%xmm8,%r11				#Move Sign to gpr **
+
+movapd	%xmm0,%xmm2				#x
+movapd	%xmm0,%xmm6				#x
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+
+	cmp	%r10,%rax				#is lower arg >= 5e5
+	jae	.Llower_or_both_arg_gt_5e5
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lupper_arg_gt_5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lboth_arg_lt_than_5e5:
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2					# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4				# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2				# and back to double.
+
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%r8						# Region
+
+	mov 	.L__reald_one_zero(%rip),%rdx			#compare value for cossin path
+	mov	%r8,%r10
+	mov	%r8,%rcx
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm0 =rhead, xmm8 =rtail
+
+	and	.L__reald_one_one(%rip),%r8			#odd/even region for cos/sin
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	mov	%r10,%rax
+	not 	%r11						#ADDED TO CHANGE THE LOGIC
+	and	%r11,%r10
+	not	%rax
+	not	%r11
+	and	%r11,%rax
+	or	%rax,%r10
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	mov	%r10,%r11
+	and	%rdx,%r11				#mask out the lower sign bit leaving the upper sign bit
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$31,%r11				#shift upper sign bit left by 31 bits
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r11,p_sign+8(%rsp)		#write out upper sign bit
+
+# xmm4  = Sign, xmm0 =rhead, xmm8 =rtail
+
+	movapd	%xmm0,%xmm6						# rhead
+	subpd	%xmm8,%xmm0						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+
+	subpd	%xmm0,%xmm6				#rr=rhead-r
+	movapd	%xmm0,%xmm2				#move r for r2
+	mulpd	%xmm0,%xmm2				#r2
+	subpd	%xmm8,%xmm6				#rr=(rhead-r) -rtail
+
+	mov 	.L__reald_one_zero(%rip),%r9		# Compare value for cossin	+
+
+
+	add	.L__reald_one_one(%rip),%rcx
+	and	.L__reald_two_two(%rip),%rcx
+	shr	$1,%rcx
+
+	mov	%rcx,%rdx
+	and	%r9,%rdx				#mask out the lower sign bit leaving the upper sign bit
+	shl	$63,%rcx				#shift lower sign bit left by 63 bits
+	shl	$31,%rdx				#shift upper sign bit left by 31 bits
+	mov 	 %rcx,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %rdx,p_sign1+8(%rsp)		#write out upper sign bit
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 16
+.L__vrd2_sincos_approximate:
+	cmp 	$0,%r8
+	jnz	.Lvrd2_not_sin_piby4
+
+.Lvrd2_sin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm2,%xmm11					# x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm2,%xmm5					# c6*x2
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm2,%xmm9					# c3*x2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	movapd	 %xmm2,p_temp(%rsp)			# store x2
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	movapd	 %xmm10,p_temp2(%rsp)			# store r
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	mulpd	%xmm2,%xmm11					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm5					# x2(c5+x2c6)
+	movapd	 %xmm10,p_temp1(%rsp) 			# store t
+	movapd	%xmm11,%xmm3					# Keep x4
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm2,%xmm9					# x2(c2+x2C3)
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	mulpd	%xmm2,%xmm11					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	subpd   p_temp2(%rsp),%xmm10			# (1 + (-t)) - r
+	mulpd	%xmm0,%xmm2					# x3 recalculate
+
+	mulpd	%xmm11,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	movapd	%xmm0,%xmm1
+	movapd	%xmm6,%xmm7
+	mulpd	%xmm6,%xmm1					# x*xx
+	mulpd	p_temp2(%rsp),%xmm7			# xx * 0.5x2
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zs
+
+	subpd   %xmm1,%xmm10					# ((1 + (-t)) - r) -x*xx
+
+	mulpd	%xmm3,%xmm4					# x4 * zc
+	mulpd	%xmm2,%xmm5					# x3 * zs
+
+	addpd	%xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	subpd	%xmm7,%xmm5					# x3*zs - 0.5 * x2 *xx
+
+	addpd	%xmm6,%xmm5					# sin + xx
+	subpd	p_temp1(%rsp),%xmm4			# cos - (-t)
+	addpd	%xmm0,%xmm5					# sin + x
+
+	jmp 	.L__vrd2_sincos_cleanup
+
+.align 16
+.Lvrd2_not_sin_piby4:
+	cmp 	.L__reald_one_one(%rip),%r8
+	jnz	.Lvrd2_not_cos_piby4
+
+.Lvrd2_cos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm2,%xmm11					# x2
+
+	mulpd	%xmm2,%xmm5					# c6*x2
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm2,%xmm9					# c3*x2
+	mulpd	%xmm2,%xmm8					# c3*x2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	movapd	 %xmm2,p_temp(%rsp)			# store x2
+
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	movapd	 %xmm10,p_temp2(%rsp)			# store r
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	mulpd	%xmm2,%xmm11					# x4
+
+	mulpd	%xmm2,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	movapd	 %xmm10,p_temp1(%rsp) 			# store t
+	movapd	%xmm11,%xmm3					# Keep x4
+	mulpd	%xmm2,%xmm9					# x2(c2+x2C3)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	mulpd	%xmm2,%xmm11					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+
+	subpd   p_temp2(%rsp),%xmm10			# (1 + (-t)) - r
+	mulpd	%xmm0,%xmm2					# x3 recalculate
+
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm4					# x6(c4 + x2(c5+x2c6))
+
+	movapd	%xmm0,%xmm1
+	movapd	%xmm6,%xmm7
+	mulpd	%xmm6,%xmm1					# x*xx
+	mulpd	p_temp2(%rsp),%xmm7			# xx * 0.5x2
+
+	addpd	%xmm9,%xmm5					# zc
+	addpd	%xmm8,%xmm4					# zs
+
+	subpd   %xmm1,%xmm10					# ((1 + (-t)) - r) -x*xx
+
+	mulpd	%xmm3,%xmm5					# x4 * zc
+	mulpd	%xmm2,%xmm4					# x3 * zs
+
+	addpd	%xmm10,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	subpd	%xmm7,%xmm4					# x3*zs - 0.5 * x2 *xx
+
+	addpd	%xmm6,%xmm4					# sin + xx
+	subpd	p_temp1(%rsp),%xmm5			# cos - (-t)
+	addpd	%xmm0,%xmm4					# sin + x
+
+	jmp 	.L__vrd2_sincos_cleanup
+
+.align 16
+.Lvrd2_not_cos_piby4:
+	cmp 	$1,%r8
+	jnz	.Lvrd2_cossin_piby4
+
+.Lvrd2_sincos_piby4:
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm2,%xmm11					# x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm2,%xmm5					# c6*x2
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm2,%xmm9					# c3*x2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	movapd	 %xmm2,p_temp(%rsp)			# store x2
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	movapd	 %xmm10,p_temp2(%rsp)			# store r
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	mulpd	%xmm2,%xmm11					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm5					# x2(c5+x2c6)
+	movapd	 %xmm10,p_temp1(%rsp) 			# store t
+	movapd	%xmm11,%xmm3					# Keep x4
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm2,%xmm9					# x2(c2+x2C3)
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	mulpd	%xmm2,%xmm11					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	subpd   p_temp2(%rsp),%xmm10			# (1 + (-t)) - r
+	mulpd	%xmm0,%xmm2					# x3 recalculate
+
+	mulpd	%xmm11,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	movapd	%xmm0,%xmm1
+	movapd	%xmm6,%xmm7
+	mulpd	%xmm6,%xmm1					# x*xx
+	mulpd	p_temp2(%rsp),%xmm7			# xx * 0.5x2
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zs
+
+	subpd   %xmm1,%xmm10					# ((1 + (-t)) - r) -x*xx
+
+	mulpd	%xmm3,%xmm4					# x4 * zc
+	mulpd	%xmm2,%xmm5					# x3 * zs
+
+	addpd	%xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	subpd	%xmm7,%xmm5					# x3*zs - 0.5 * x2 *xx
+
+	addpd	%xmm6,%xmm5					# sin + xx
+	subpd	p_temp1(%rsp),%xmm4			# cos - (-t)
+	addpd	%xmm0,%xmm5					# sin + x
+
+	movsd	%xmm4,%xmm1
+	movsd	%xmm5,%xmm4
+	movsd	%xmm1,%xmm5
+
+	jmp 	.L__vrd2_sincos_cleanup
+
+.align 16
+.Lvrd2_cossin_piby4:
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm2,%xmm11					# x2
+
+	mulpd	%xmm2,%xmm5					# c6*x2
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm2,%xmm9					# c3*x2
+	mulpd	%xmm2,%xmm8					# c3*x2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	movapd	 %xmm2,p_temp(%rsp)			# store x2
+
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	movapd	 %xmm10,p_temp2(%rsp)			# store r
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	mulpd	%xmm2,%xmm11					# x4
+
+	mulpd	%xmm2,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	movapd	 %xmm10,p_temp1(%rsp) 			# store t
+	movapd	%xmm11,%xmm3					# Keep x4
+	mulpd	%xmm2,%xmm9					# x2(c2+x2C3)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	mulpd	%xmm2,%xmm11					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+
+	subpd   p_temp2(%rsp),%xmm10			# (1 + (-t)) - r
+	mulpd	%xmm0,%xmm2					# x3 recalculate
+
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm4					# x6(c4 + x2(c5+x2c6))
+
+	movapd	%xmm0,%xmm1
+	movapd	%xmm6,%xmm7
+	mulpd	%xmm6,%xmm1					# x*xx
+	mulpd	p_temp2(%rsp),%xmm7			# xx * 0.5x2
+
+	addpd	%xmm9,%xmm5					# zc
+	addpd	%xmm8,%xmm4					# zs
+
+	subpd   %xmm1,%xmm10					# ((1 + (-t)) - r) -x*xx
+
+	mulpd	%xmm3,%xmm5					# x4 * zc
+	mulpd	%xmm2,%xmm4					# x3 * zs
+
+	addpd	%xmm10,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	subpd	%xmm7,%xmm4					# x3*zs - 0.5 * x2 *xx
+
+	addpd	%xmm6,%xmm4					# sin + xx
+	subpd	p_temp1(%rsp),%xmm5			# cos - (-t)
+	addpd	%xmm0,%xmm4					# sin + x
+
+	movsd	%xmm5,%xmm1
+	movsd	%xmm4,%xmm5
+	movsd	%xmm1,%xmm4
+
+	jmp 	.L__vrd2_sincos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Llower_or_both_arg_gt_5e5:
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+
+	movlpd	 %xmm0,r(%rsp)		#Save lower fp arg for remainder_piby2 call
+
+	movhlps	%xmm0,%xmm0			#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+
+#If upper Arg is <=piby4
+       cmp	%rdx,%rcx					# is upper arg > piby4
+       ja	0f
+
+       mov 	$0,%ecx						# region = 0
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movlpd	 %xmm0,r+8(%rsp)			# store upper r (unsigned - sign is adjusted later based on sign)
+       xorpd	%xmm4,%xmm4					# rr = 0
+       movlpd	 %xmm4,rr+8(%rsp)			# store upper rr
+       jmp	.Lcheck_lower_arg
+
+#If upper Arg is > piby4
+.align 16
+0:
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm3		# piby2_1
+	cvttsd2si	%xmm2,%ecx				# npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm1		# piby2_2
+	cvtsi2sd	%ecx,%xmm2				# npi2 trunc to doubles
+
+	#/* Subtract the multiple from x to get an extra-precision remainder */
+	#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm3					# npi2 * piby2_1
+	subsd	%xmm3,%xmm6					# rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm8		# piby2_2tail
+
+	#t  = rhead;
+       movsd	%xmm6,%xmm5					# t = rhead
+
+	#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm1					# rtail=(npi2*piby2_2)
+
+	#rhead  = t - rtail
+       subsd	%xmm1,%xmm6					# rhead=(t-rtail)
+
+	#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm8     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm1					# (rtail-(t-rhead))
+       addsd	%xmm8,%xmm1					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+	#r =  rhead - rtail
+	#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	%xmm6,%xmm0
+       subsd	%xmm1,%xmm0					# r=(rhead-rtail)
+
+       subsd	%xmm0,%xmm6					# rr=rhead-r
+       subsd	%xmm1,%xmm6					# xmm4 = rr=((rhead-r) -rtail)
+
+       movlpd	 %xmm0,r+8(%rsp)			# store upper r
+       movlpd	 %xmm6,rr+8(%rsp)			# store upper rr
+
+#If lower Arg is > 5e5
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_lower_arg:
+	mov		$0x07ff0000000000000,%r9			# is lower arg nan/inf
+	mov		%r9,%r10
+	and		%rax,%r10
+	cmp		%r9,%r10
+	jz		.L__vrd2_cos_lower_naninf
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+	movlpd	 r(%rsp),%xmm0				#Restore lower fp arg for remainder_piby2 call
+	mov	 %r11,p_temp(%rsp)			#Save Sign
+        call    __amd_remainder_piby2@PLT
+	mov	p_temp(%rsp),%r11			#Restore Sign
+
+	jmp 	.L__vrd2_cos_reconstruct
+
+.L__vrd2_cos_lower_naninf:
+	mov	p_original(%rsp),%rax			# upper arg is nan/inf
+
+	mov	$0x00008000000000000,%r9
+	or	%r9,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				# rr = 0
+	mov	 %r10d,region(%rsp)			# region =0
+	and 	.L__real_naninf_lower_sign_mask(%rip),%r11	# Sign
+
+	jmp 	.L__vrd2_cos_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lupper_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+	movhpd	 %xmm0,r+8(%rsp)		#Save upper fp arg for remainder_piby2 call
+#	movlhps	%xmm0,%xmm0				;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+#	movlhps	%xmm2,%xmm2
+#	movlhps	%xmm6,%xmm6
+
+# Work on Lower arg
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+#If lower Arg is <=piby4
+       cmp	%rdx,%rax					# is upper arg > piby4
+       ja	0f
+
+       mov 	$0,%eax						# region = 0
+       mov	 %eax,region(%rsp)			# store upper region
+       movlpd	 %xmm0,r(%rsp)				# store upper r
+       xorpd	%xmm4,%xmm4					# rr = 0
+       movlpd	 %xmm4,rr(%rsp)				# store upper rr
+       jmp 	.Lcheck_upper_arg
+
+.align 16
+0:
+#If upper Arg is > piby4
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm3		# piby2_1
+	cvttsd2si	%xmm2,%eax				# npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm1		# piby2_2
+	cvtsi2sd	%eax,%xmm2				# npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm3					# npi2 * piby2_1;
+	subsd	%xmm3,%xmm6					# rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm8		# piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm1					# rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm1,%xmm6					# rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm8     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm1					# (rtail-(t-rhead))
+       addsd	%xmm8,%xmm1					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store lower region
+       movsd	%xmm6,%xmm0
+       subsd	%xmm1,%xmm0					# r=(rhead-rtail)
+       subsd	%xmm0,%xmm6					# rr=rhead-r
+       subsd	%xmm1,%xmm6					# rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r(%rsp)				# store lower r
+       movlpd	 %xmm6,rr(%rsp)				# store lower rr
+
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_upper_arg:
+	mov		$0x07ff0000000000000,%r9			# is upper arg nan/inf
+	mov		%r9,%r10
+	and		%rcx,%r10
+	cmp		%r9,%r10
+	jz		.L__vrd2_cos_upper_naninf
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+	mov	 %r11,p_temp(%rsp)	#Save Sign
+        call    __amd_remainder_piby2@PLT
+	mov	p_temp(%rsp),%r11	#Restore Sign
+
+	jmp 	.L__vrd2_cos_reconstruct
+
+.L__vrd2_cos_upper_naninf:
+	mov	p_original+8(%rsp),%rcx		# upper arg is nan/inf
+	mov	$0x00008000000000000,%r9
+	or	%r9,%rcx
+	mov	 %rcx,r+8(%rsp)					# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr+8(%rsp)					# rr = 0
+	mov	 %r10d,region+4(%rsp)				# region =0
+	and 	.L__real_naninf_upper_sign_mask(%rip),%r11	# Sign
+	jmp 	.L__vrd2_cos_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+
+	movhpd	%xmm0,  p_temp2(%rsp) 				#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r9			#is lower arg nan/inf
+	mov		%r9,%r10
+	and		%rax,%r10
+	cmp		%r9,%r10
+	jz		.L__vrd2_cos_lower_naninf_of_both_gt_5e5
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r11,p_temp1(%rsp)	#Save Sign
+        call    __amd_remainder_piby2@PLT
+	mov	 p_temp1(%rsp),%r11	#Restore Sign
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	jmp 	0f
+
+.L__vrd2_cos_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+	mov	p_original(%rsp),%rax
+	mov	$0x00008000000000000,%r9
+	or	%r9,%rax
+	mov	 %rax,r(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				#rr = 0
+	mov	 %r10d,region(%rsp)			#region = 0
+	and 	.L__real_naninf_lower_sign_mask(%rip),%r11	# Sign
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r9			#is upper arg nan/inf
+	mov		%r9,%r10
+	and		%rcx,%r10
+	cmp		%r9,%r10
+	jz		.L__vrd2_cos_upper_naninf_of_both_gt_5e5
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 p_temp2(%rsp), %xmm0			#Restore upper fp arg for remainder_piby2 call
+	mov	 %r11,p_temp(%rsp)	#Save Sign
+        call    __amd_remainder_piby2@PLT
+	mov	 p_temp(%rsp),%r11	#Restore Sign
+
+	jmp 	0f
+
+.L__vrd2_cos_upper_naninf_of_both_gt_5e5:
+	mov	p_original+8(%rsp),%rcx		#upper arg is nan/inf
+	mov	$0x00008000000000000,%r9
+	or	%r9,%rcx
+	mov	%rcx,r+8(%rsp)					#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr+8(%rsp)					#rr = 0
+	mov	%r10d,region+4(%rsp)				#region = 0
+	and 	.L__real_naninf_upper_sign_mask(%rip),%r11	# Sign
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+0:
+.L__vrd2_cos_reconstruct:
+#Construct p_sign=Sign for Sin term, p_sign1=Sign for Cos term, xmm0 = r, xmm2 = %xmm6,%r2 =rr, r8=region
+	movapd	r(%rsp),%xmm0				#x
+	movapd	%xmm0,%xmm2					#move for x2
+	mulpd	%xmm2,%xmm2					#x2
+	movapd	rr(%rsp),%xmm6				#xx
+
+	mov	region(%rsp),%r8
+	mov 	.L__reald_one_zero(%rip),%r9		#compare value for cossin path
+	mov 	%r8,%r10
+	mov	%r8,%rax
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	mov	%r10,%rcx
+	not 	%r11						#ADDED TO CHANGE THE LOGIC
+	and	%r11,%r10
+	not	%rcx
+	not	%r11
+	and	%r11,%rcx
+	or	%rcx,%r10
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+
+	mov	%r10,%r11
+	and	%r9,%r11				#mask out the lower sign bit leaving the upper sign bit
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$31,%r11				#shift upper sign bit left by 31 bits
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r11,p_sign+8(%rsp)		#write out upper sign bit
+
+	add	.L__reald_one_one(%rip),%rax
+	and	.L__reald_two_two(%rip),%rax
+	shr	$1,%rax
+
+	mov	%rax,%rdx
+	and	%r9,%rdx				#mask out the lower sign bit leaving the upper sign bit
+	shl	$63,%rax				#shift lower sign bit left by 63 bits
+	shl	$31,%rdx				#shift upper sign bit left by 31 bits
+	mov 	 %rax,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %rdx,p_sign1+8(%rsp)		#write out upper sign bit
+
+
+	jmp .L__vrd2_sincos_approximate
+
+
+#ENDMAIN
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd2_sincos_cleanup:
+
+	xorpd	p_sign(%rsp),%xmm5		# SIN sign
+	xorpd	p_sign1(%rsp),%xmm4		# COS sign
+
+	mov	p_sin(%rsp),%rdi
+	mov	p_cos(%rsp),%rsi
+
+	movapd	 %xmm5,(%rdi)			# save the sin
+	movapd	 %xmm4,(%rsi)			# save the cos
+
+.Lfinal_check:
+	add	$0x1C8,%rsp
+	ret
+

diff --git a/src/gas/vrd4cos.S b/src/gas/vrd4cos.S
new file mode 100644
index 0000000..5ecc97c
--- /dev/null
+++ b/src/gas/vrd4cos.S

@@ -0,0 +1,2987 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# vrd4cos.s
+#
+# A vector implementation of the cos libm function.
+#
+# Prototype:
+#
+#    __m128d,__m128d __vrd4_cos(__m128d x1, __m128d x2);
+#
+# Computes Cosine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 double precision Cosine values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+# Author: Harsha Jagasia
+# Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+
+.data
+.align 16
+.L__real_7fffffffffffffff: 	.quad 0x07fffffffffffffff	#Sign bit zero
+				.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: 	.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+				.quad 0x03e40000000000000
+.L__real_3fe0000000000000: 	.quad 0x03fe0000000000000	# 0.5
+				.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: 	.quad 0x03fc5555555555555	# 0.166666666666
+				.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: 	.quad 0x03fe45f306dc9c883	# twobypi
+				.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: 	.quad 0x03ff921fb54400000	# piby2_1
+				.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: 	.quad 0x03dd0b4611a626331	# piby2_1tail
+				.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: 	.quad 0x03dd0b4611a600000	# piby2_2
+				.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: 	.quad 0x03ba3198a2e037073	# piby2_2tail
+				.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: 	.quad 0x0fffffffff8000000	# mask for stripping head and tail
+				.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+				.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+				.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+				.quad 0
+.L__reald_one_zero:		.quad 0x00000000100000000	# sin_cos_filter
+				.quad 0
+.L__reald_zero_one:		.quad 0x00000000000000001	#
+				.quad 0
+.L__reald_two_zero:		.quad 0x00000000200000000	#
+				.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+				.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+				.quad 0x03ff0000000000000	#
+.L__real_zero:			.quad 0x00000000000000000	#
+				.quad 0x00000000000000000	#
+.L__real_one:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+
+.Lcosarray:
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x03fa5555555555555
+	.quad	0x0bf56c16c16c16967		# -0.00138889	   		c2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03efa01a019f4ec90		# 2.48016e-005			c3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0be927e4fa17f65f6		# -2.75573e-007			c4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03e21eeb69037ab78		# 2.08761e-009			c5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0bda907db46cc5e42		# -1.13826e-011	   		c6
+	.quad	0x0bda907db46cc5e42
+.Lsinarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bfc5555555555555
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x03f81111111110bb3
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0bf2a01a019e83e5c
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x03ec71de3796cde01
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0be5ae600b42fdfa7
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+	.quad	0x03de5e0b2f9a43bb8
+.Lsincosarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+	.quad	0x0bda907db46cc5e42
+.Lcossinarray:
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0bda907db46cc5e42
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+
+.align 16
+.Levencos_oddsin_tbl:
+		.quad	.Lcoscos_coscos_piby4		# 0		*
+		.quad	.Lcoscos_cossin_piby4		# 1		+
+		.quad	.Lcoscos_sincos_piby4		# 2
+		.quad	.Lcoscos_sinsin_piby4		# 3		+
+
+		.quad	.Lcossin_coscos_piby4		# 4
+		.quad	.Lcossin_cossin_piby4		# 5		*
+		.quad	.Lcossin_sincos_piby4		# 6
+		.quad	.Lcossin_sinsin_piby4		# 7
+
+		.quad	.Lsincos_coscos_piby4		# 8
+		.quad	.Lsincos_cossin_piby4		# 9
+		.quad	.Lsincos_sincos_piby4		# 10		*
+		.quad	.Lsincos_sinsin_piby4		# 11
+
+		.quad	.Lsinsin_coscos_piby4		# 12
+		.quad	.Lsinsin_cossin_piby4		# 13		+
+		.quad	.Lsinsin_sincos_piby4		# 14
+		.quad	.Lsinsin_sinsin_piby4		# 15		*
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+    .text
+    .align 16
+    .p2align 4,,15
+
+# define local variable storage offsets
+.equ	p_temp,		0x00		# temporary for get/put bits operation
+.equ	p_temp1,	0x10		# temporary for get/put bits operation
+
+.equ	p_xmm6,		0x20		# temporary for get/put bits operation
+.equ	p_xmm7,		0x30		# temporary for get/put bits operation
+.equ	p_xmm8,		0x40		# temporary for get/put bits operation
+.equ	p_xmm9,		0x50		# temporary for get/put bits operation
+.equ	p_xmm10,	0x60		# temporary for get/put bits operation
+.equ	p_xmm11,	0x70		# temporary for get/put bits operation
+.equ	p_xmm12,	0x80		# temporary for get/put bits operation
+.equ	p_xmm13,	0x90		# temporary for get/put bits operation
+.equ	p_xmm14,	0x0A0		# temporary for get/put bits operation
+.equ	p_xmm15,	0x0B0		# temporary for get/put bits operation
+
+.equ	r,		0x0C0		# pointer to r for remainder_piby2
+.equ	rr,		0x0D0		# pointer to r for remainder_piby2
+.equ	region,		0x0E0		# pointer to r for remainder_piby2
+
+.equ	r1,		0x0F0		# pointer to r for remainder_piby2
+.equ	rr1,		0x0100		# pointer to r for remainder_piby2
+.equ	region1,	0x0110		# pointer to r for remainder_piby2
+
+.equ	p_temp2,	0x0120		# temporary for get/put bits operation
+.equ	p_temp3,	0x0130		# temporary for get/put bits operation
+
+.equ	p_temp4,	0x0140		# temporary for get/put bits operation
+.equ	p_temp5,	0x0150		# temporary for get/put bits operation
+
+.equ	p_original,	0x0160		# original x
+.equ	p_mask,		0x0170		# original x
+.equ	p_sign,		0x0180		# original x
+
+.equ	p_original1,	0x0190		# original x
+.equ	p_mask1,	0x01A0		# original x
+.equ	p_sign1,	0x01B0		# original x
+
+.globl __vrd4_cos
+    .type   __vrd4_cos,@function
+__vrd4_cos:
+	sub		$0x1C8,%rsp
+
+#DEBUG
+#	add		$0x1C8,%rsp
+#	ret
+#	movapd	%xmm0,%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+movapd	.L__real_7fffffffffffffff(%rip),%xmm2
+movdqa	%xmm0, p_original(%rsp)
+movdqa	%xmm1, p_original1(%rsp)
+
+andpd 	%xmm2,%xmm0				#Unsign
+andpd 	%xmm2,%xmm1				#Unsign
+
+movd	%xmm0,%rax				#rax is lower arg
+movhpd	%xmm0, p_temp+8(%rsp)			#
+mov    	p_temp+8(%rsp),%rcx			#rcx = upper arg
+movd	%xmm1,%r8				#rax is lower arg
+movhpd	%xmm1, p_temp1+8(%rsp)			#
+mov    	p_temp1+8(%rsp),%r9			#rcx = upper arg
+
+mov 	$0x3FE921FB54442D18,%rdx		#piby4	+
+mov	$0x411E848000000000,%r10		#5e5	+
+
+movapd	.L__real_3fe0000000000000(%rip),%xmm4	#0.5 for later use
+
+movapd	%xmm0,%xmm2				#x0
+movapd	%xmm1,%xmm3				#x1
+movapd	%xmm0,%xmm6				#x0
+movapd	%xmm1,%xmm7				#x1
+
+#DEBUG
+#	add		$0x1C8,%rsp
+#	ret
+#	movapd	%xmm0,%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax
+	jae	.Lfirst_or_next3_arg_gt_5e5
+
+	cmp	%r10,%rcx
+	jae	.Lsecond_or_next2_arg_gt_5e5
+
+	cmp	%r10,%r8
+	jae	.Lthird_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfourth_arg_gt_5e5
+
+
+#      /* Find out what multiple of piby2 */
+#        npi2  = (int)(x * twobypi + 0.5);
+	movapd	.L__real_3fe45f306dc9c883(%rip),%xmm0
+	mulpd	%xmm0,%xmm2						# * twobypi
+	mulpd	%xmm0,%xmm3						# * twobypi
+
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%rax						# Region
+	movd	%xmm5,%rcx						# Region
+
+	mov	%rax,%r8
+	mov	%rcx,%r9
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm1,%xmm7						# t-rhead
+
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm0 =rhead, xmm8 =rtail
+# xmm5  = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+#	paddd		.L__reald_one_one(%rip),%xmm4		; Sign
+#	paddd		.L__reald_one_one(%rip),%xmm5		; Sign
+#	pand		.L__reald_two_two(%rip),%xmm4
+#	pand		.L__reald_two_two(%rip),%xmm5
+#	punpckldq 	%xmm4,%xmm4
+#	punpckldq 	%xmm5,%xmm5
+#	psllq		$62,%xmm4
+#	psllq		$62,%xmm5
+
+
+	add .L__reald_one_one(%rip),%r8
+	add .L__reald_one_one(%rip),%r9
+	and .L__reald_two_two(%rip),%r8
+	and .L__reald_two_two(%rip),%r9
+
+	mov %r8,%r10
+	mov %r9,%r11
+	shl $62,%r8
+	and .L__reald_two_zero(%rip),%r10
+	shl $30,%r10
+	shl $62,%r9
+	and .L__reald_two_zero(%rip),%r11
+	shl $30,%r11
+
+	mov	 %r8,p_sign(%rsp)
+	mov	 %r10,p_sign+8(%rsp)
+	mov	 %r9,p_sign1(%rsp)
+	mov	 %r11,p_sign1+8(%rsp)
+
+# GET_BITS_DP64(rhead-rtail, uy);			   		; originally only rhead
+# xmm4  = Sign, xmm0 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 =rhead, xmm9 =rtail
+	movapd	%xmm0,%xmm6						# rhead
+	movapd	%xmm1,%xmm7						# rhead
+
+	and	.L__reald_one_one(%rip),%rax		# Region
+	and	.L__reald_one_one(%rip),%rcx		# Region
+
+	subpd	%xmm8,%xmm0						# r = rhead - rtail
+	subpd	%xmm9,%xmm1						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+	subpd	%xmm0,%xmm6				#rr=rhead-r
+	subpd	%xmm1,%xmm7				#rr=rhead-r
+
+	mov	%rax,%r8
+	mov	%rcx,%r9
+
+	movapd	%xmm0,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm0,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	subpd	%xmm8,%xmm6				#rr=(rhead-r) -rtail
+	subpd	%xmm9,%xmm7				#rr=(rhead-r) -rtail
+
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+	leaq	 .Levencos_oddsin_tbl(%rip),%rsi
+	jmp	 *(%rsi,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+#DEBUG
+#	movapd	%xmm0,%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm10, xmm12
+#	    %xmm11,,%xmm9 xmm13
+
+
+#DEBUG
+#	movapd	%xmm0,%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+	movlpd	 %xmm0,r(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm0,%xmm0			#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm8 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm12 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm10					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	 %xmm6,%xmm0
+       subsd	 %xmm10,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	 %xmm0,%xmm6					# rr=rhead-r
+       subsd	 %xmm10,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r+8(%rsp)			# store upper r
+       movlpd	 %xmm6,rr+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_lower_naninf
+
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+	movlpd	 r(%rsp),%xmm0	#Restore lower fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrd4_cos_lower_naninf:
+	mov	p_original(%rsp),%rax			# upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				# rr = 0
+	mov	 %r10d,region(%rsp)			# region =0
+
+.align 16
+0:
+
+
+#DEBUG
+#	movapd	.LOWORD,%xmm4 PTR r[rsp]
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+
+#DEBUG
+#	movapd	%xmm0,%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+
+	movhpd 	%xmm0,r+8(%rsp)		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrd4_cos_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+	mov	p_original(%rsp),%rax
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				#rr = 0
+	mov	 %r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_upper_naninf_of_both_gt_5e5
+
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrd4_cos_upper_naninf_of_both_gt_5e5:
+	mov	p_original+8(%rsp),%rcx		#upper arg is nan/inf
+#	movd	%xmm6,%rcx					;upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	%rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr+8(%rsp)			#rr = 0
+	mov	%r10d,region+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm10,,%xmm8 xmm12
+#   %xmm9,,%xmm5 xmm11, xmm13
+
+	movhpd	%xmm0,r+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+#	movlhps	%xmm0,%xmm0			;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+#	movlhps	%xmm2,%xmm2
+#	movlhps	%xmm6,%xmm6
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm7 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm10					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store upper region
+       movsd	%xmm6,%xmm0
+       subsd	%xmm10,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm0,%xmm6					# rr=rhead-r
+       subsd	%xmm10,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r(%rsp)				# store upper r
+       movlpd	 %xmm6,rr(%rsp)				# store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_upper_naninf
+
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	 %xmm1,p_temp1(%rsp)
+	movapd	 %xmm3,p_temp3(%rsp)
+	movapd	 %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	p_temp1(%rsp),%xmm1
+	movapd	p_temp3(%rsp),%xmm3
+	movapd	p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrd4_cos_upper_naninf:
+	mov	p_original+8(%rsp),%rcx		# upper arg is nan/inf
+#	mov	r+8(%rsp),%rcx				; upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr+8(%rsp)			# rr = 0
+	mov	 %r10d,region+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+#DEBUG
+#	movapd	r(%rsp),%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r8
+	jae	.Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfirst_second_done_fourth_arg_gt_5e5
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4			#Restore 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm3						# * twobypi
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm5,region1(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm1,%xmm7						# t-rhead
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm1,%xmm7						# rhead
+       subpd	%xmm9,%xmm1						# r = rhead - rtail
+       movapd	 %xmm1,r1(%rsp)
+
+       subpd	%xmm1,%xmm7						# rr=rhead-r
+       subpd	%xmm9,%xmm7						# rr=(rhead-r) -rtail
+       movapd	 %xmm7,rr1(%rsp)
+
+	jmp	.L__vrd4_cos_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use 	%xmm11,,%xmm9 xmm13
+# 	%xmm8,,%xmm5 xmm10, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+#DEBUG
+#	movapd	%xmm0,%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2						# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm4,region(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm0,%xmm6						# rhead
+       subpd	%xmm8,%xmm0						# r = rhead - rtail
+       movapd	 %xmm0,r(%rsp)
+
+       subpd	%xmm0,%xmm6						# rr=rhead-r
+       subpd	%xmm8,%xmm6						# rr=(rhead-r) -rtail
+       movapd	 %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+
+#DEBUG
+#	movapd	r(%rsp),%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+	cmp	%r10,%r9
+	jae	.Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movlpd	 %xmm1,r1(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm1,%xmm1			#Needed since we want to work on upper arg
+	movhlps	%xmm3,%xmm3
+	movhlps	%xmm7,%xmm7
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4	#0.5 for later use
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r9d				# r9d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%r9d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm0					# xmm0 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r9d,region1+4(%rsp)			# store upper region
+       movsd	%xmm7,%xmm1
+       subsd	%xmm0,%xmm1					# xmm1 = r=(rhead-rtail)
+       subsd	%xmm1,%xmm7					# rr=rhead-r
+       subsd	%xmm0,%xmm7					# xmm7 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm1,r1+8(%rsp)			# store upper r
+       movlpd	 %xmm7,rr1+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_lower_naninf_higher
+
+	lea	 region1(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr1(%rsp),%rsi
+	lea	 r1(%rsp),%rdi
+	movlpd	 r1(%rsp),%xmm0				#Restore lower fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_cos_lower_naninf_higher:
+	mov	p_original1(%rsp),%r8			# upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1(%rsp)				# rr = 0
+	mov	 %r10d,region1(%rsp)			# region =0
+
+.align 16
+0:
+
+
+#DEBUG
+#	movapd	rr(%rsp),%xmm4
+#	movapd	rr1(%rsp),%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	jmp 	.L__vrd4_cos_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+#DEBUG
+#	movapd	r(%rsp),%xmm4
+#	movd	%r8,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	movhpd 	%xmm1,r1+8(%rsp)				#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher
+
+	mov	  %r9,p_temp1(%rsp)			#Save upper arg
+	lea	 region1(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr1(%rsp),%rsi
+	lea	 r1(%rsp),%rdi
+	movsd	 %xmm1,%xmm0
+        call    __amd_remainder_piby2@PLT
+	mov	 p_temp1(%rsp),%r9			#Restore upper arg
+
+
+#DEBUG
+#	movapd	 r(%rsp),%xmm4
+#	mov	 QWORD PTR r1[rsp+8], r9
+#	movapd	 r1(%rsp),%xmm5
+#	xorpd	 %xmm0,%xmm0
+#	xorpd	 %xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+	jmp 	0f
+
+.L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher:				#lower arg is nan/inf
+	mov	p_original1(%rsp),%r8
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1(%rsp)				#rr = 0
+	mov	 %r10d,region1(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher
+
+	lea	 region1+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr1+8(%rsp),%rsi
+	lea	 r1+8(%rsp),%rdi
+	movlpd	 r1+8(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher:
+	mov	p_original1+8(%rsp),%r9		#upper arg is nan/inf
+#	movd	%xmm6,%r9					;upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1+8(%rsp)			#rr = 0
+	mov	 %r10d,region1+4(%rsp)			#region = 0
+
+.align 16
+0:
+
+#DEBUG
+#	movapd	r(%rsp),%xmm4
+#	movapd	r1(%rsp),%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+	jmp 	.L__vrd4_cos_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2						# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm4,region(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm0,%xmm6						# rhead
+       subpd	%xmm8,%xmm0						# r = rhead - rtail
+       movapd	 %xmm0,r(%rsp)
+
+       subpd	%xmm0,%xmm6						# rr=rhead-r
+       subpd	%xmm8,%xmm6						# rr=(rhead-r) -rtail
+       movapd	 %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd	 %xmm1,r1+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+#	movlhps	%xmm1,%xmm1			;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+#	movlhps	%xmm3,%xmm3
+#	movlhps	%xmm7,%xmm7
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4	#0.5 for later use
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r8d				# r8d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%r8d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm0					# xmm0 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r8d,region1(%rsp)			# store lower region
+       movsd	%xmm7,%xmm1
+       subsd	%xmm0,%xmm1					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm1,%xmm7					# rr=rhead-r
+       subsd	%xmm0,%xmm7					# xmm6 = rr=((rhead-r) -rtail)
+
+       movlpd	 %xmm1,r1(%rsp)				# store upper r
+       movlpd	 %xmm7,rr1(%rsp)				# store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_upper_naninf_higher
+
+	lea	 region1+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr1+8(%rsp),%rsi
+	lea	 r1+8(%rsp),%rdi
+	movlpd	 r1+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_cos_upper_naninf_higher:
+	mov	p_original1+8(%rsp),%r9		# upper arg is nan/inf
+#	mov	r1+8(%rsp),%r9			# upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)			# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1+8(%rsp)		# rr = 0
+	mov	 %r10d,region1+4(%rsp)		# region =0
+
+.align 16
+0:
+	jmp	.L__vrd4_cos_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_cos_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#DEBUG
+#	movapd	region(%rsp),%xmm4
+#	movapd	region1(%rsp),%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	movapd	r(%rsp),%xmm0
+	movapd	r1(%rsp),%xmm1
+
+	movapd	rr(%rsp),%xmm6
+	movapd	rr1(%rsp),%xmm7
+
+	mov	region(%rsp),%rax
+	mov	region1(%rsp),%rcx
+
+	mov	%rax,%r8
+	mov	%rcx,%r9
+
+	add .L__reald_one_one(%rip),%r8
+	add .L__reald_one_one(%rip),%r9
+	and .L__reald_two_two(%rip),%r8
+	and .L__reald_two_two(%rip),%r9
+
+	mov %r8,%r10
+	mov %r9,%r11
+	shl $62,%r8
+	and .L__reald_two_zero(%rip),%r10
+	shl $30,%r10
+	shl $62,%r9
+	and .L__reald_two_zero(%rip),%r11
+	shl $30,%r11
+
+	mov	 %r8,p_sign(%rsp)
+	mov	 %r10,p_sign+8(%rsp)
+	mov	 %r9,p_sign1(%rsp)
+	mov	 %r11,p_sign1+8(%rsp)
+
+	and	.L__reald_one_one(%rip),%rax		# Region
+	and	.L__reald_one_one(%rip),%rcx		# Region
+
+	mov	%rax,%r8
+	mov	%rcx,%r9
+
+	movapd	%xmm0,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm0,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+
+
+#DEBUG
+#	movd	%rax,%xmm4
+#	movd	%rax,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	leaq	 .Levencos_oddsin_tbl(%rip),%rsi
+	jmp	 *(%rsi,%rax,8)	#Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_cos_cleanup:
+
+	movapd	  p_sign(%rsp),%xmm0
+	movapd	  p_sign1(%rsp),%xmm1
+
+	xorpd	%xmm4,%xmm0			# (+) Sign
+	xorpd	%xmm5,%xmm1			# (+) Sign
+
+	add	$0x1C8,%rsp
+	ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	 %xmm2,%xmm4					# c6*x2
+	mulpd	 %xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+
+	mulpd	 %xmm2,%xmm8					# c3*x2
+	mulpd	 %xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0	;trash r
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10		# 1 + (-t)	;trash t
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11		# 1 + (-t)	;trash t
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm2					# x4 recalculate
+	mulpd	%xmm3,%xmm3					# x4 recalculate
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12		# t relaculate, -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13		# t relaculate, -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subpd   %xmm13,%xmm5					# + t
+
+	jmp 	.L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4		# s6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5		# s6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8		# s3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm10,%xmm10				# move high x4 for cos term
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	%xmm0,%xmm6				# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos
+	movhlps	%xmm3,%xmm13				# move high r for cos
+	movhlps	%xmm4,%xmm8				# xmm4 = sin , xmm8 = cos
+	movhlps	%xmm5,%xmm9				# xmm4 = sin , xmm8 = cos
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm7,%xmm5				# sin *x3
+	mulsd	%xmm10,%xmm8				# cos *x4
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	mulsd	p_temp(%rsp),%xmm2		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1(%rsp),%xmm3		# 0.5 * x2 * xx for sin term
+	movsd	%xmm12,%xmm6				# Keep high r for cos term
+	movsd	%xmm13,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	# -t=r-1.0
+
+	subsd	%xmm2,%xmm4				# sin - 0.5 * x2 *xx
+	subsd	%xmm3,%xmm5				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x*xx for cos term
+	movhlps	%xmm1,%xmm11				# move high x for x*xx for cos term
+
+	mulsd	p_temp+8(%rsp),%xmm10		# x * xx
+	mulsd	p_temp1+8(%rsp),%xmm11		# x * xx
+
+	movsd	%xmm12,%xmm2				# move -t for cos term
+	movsd	%xmm13,%xmm3				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm12 	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm13	# 1+(-t)
+	addsd	p_temp(%rsp),%xmm4			# sin+xx
+	addsd	p_temp1(%rsp),%xmm5			# sin+xx
+	subsd	%xmm6,%xmm12				# (1-t) - r
+	subsd	%xmm7,%xmm13				# (1-t) - r
+	subsd	%xmm10,%xmm12				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm11,%xmm13				# ((1 + (-t)) - r) - x*xx
+	addsd	%xmm0,%xmm4				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+	addsd   %xmm12,%xmm8				# cos+((1-t)-r - x*xx)
+	addsd   %xmm13,%xmm9				# cos+((1-t)-r - x*xx)
+	subsd   %xmm2,%xmm8				# cos+t
+	subsd   %xmm3,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:					# changed from sincos_sincos
+							# xmm1 is cossin and xmm0 is sincos
+
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm1,p_temp3(%rsp)		# Store r for the sincos term
+
+	movapd	.Lsincosarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lcossinarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lsincosarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lcossinarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm10,%xmm10				# move high x4 for cos term
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	movhlps	%xmm3,%xmm7				# move high x2 for x3 for sin term (sincos)
+
+	mulsd	%xmm0,%xmm6				# get low x3 for sin term
+	mulsd	p_temp3+8(%rsp),%xmm7		# get high x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos (cossin)
+	movhlps	%xmm3,%xmm13				# move high 0.5*x2 for sin term (sincos)
+
+	movhlps	%xmm4,%xmm8				# xmm8 = cos , xmm4 = sin	(cossin)
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos	(sincos)
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm11,%xmm5				# cos *x4
+	mulsd	%xmm10,%xmm8				# cos *x4
+	mulsd	%xmm7,%xmm9				# sin *x3
+
+	mulsd	p_temp(%rsp),%xmm2		# low  0.5 * x2 * xx for sin term (cossin)
+	mulsd	p_temp1+8(%rsp),%xmm13		# high 0.5 * x2 * xx for sin term (sincos)
+
+	movsd	%xmm12,%xmm6				# Keep high r for cos term
+	movsd	%xmm3,%xmm7				# Keep low r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	# -t=r-1.0
+
+	subsd	%xmm2,%xmm4				# sin - 0.5 * x2 *xx	(cossin)
+	subsd	%xmm13,%xmm9				# sin - 0.5 * x2 *xx	(sincos)
+
+	movhlps	%xmm0,%xmm10				# move high x for x*xx for cos term (cossin)
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term    (sincos)
+
+	mulsd	p_temp+8(%rsp),%xmm10		# x * xx
+	mulsd	p_temp1(%rsp),%xmm1		# x * xx
+
+	movsd	%xmm12,%xmm2				# move -t for cos term
+	movsd	%xmm3,%xmm13				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm12	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t)
+
+	addsd	p_temp(%rsp),%xmm4		# sin+xx	+
+	addsd	p_temp1+8(%rsp),%xmm9		# sin+xx	+
+
+	subsd	%xmm6,%xmm12				# (1-t) - r
+	subsd	%xmm7,%xmm3				# (1-t) - r
+
+	subsd	%xmm10,%xmm12				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm1,%xmm3				# ((1 + (-t)) - r) - x*xx
+
+	addsd	%xmm0,%xmm4				# sin + x	+
+	addsd	%xmm11,%xmm9				# sin + x	+
+
+	addsd   %xmm12,%xmm8				# cos+((1-t)-r - x*xx)
+	addsd   %xmm3,%xmm5				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm2,%xmm8				# cos+t
+	subsd   %xmm13,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrd4_cos_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm0,p_temp2(%rsp)		# Store r
+	movapd	 %xmm1,p_temp3(%rsp)		# Store r
+
+
+	movapd	.Lcossinarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lcossinarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lcossinarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lcossinarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movhlps	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movhlps	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	p_temp2+8(%rsp),%xmm6		# get low x3 for sin term
+	mulsd	p_temp3+8(%rsp),%xmm7		# get low x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high 0.5*x2 for sin term
+	movhlps	%xmm3,%xmm13				# move high 0.5*x2 for sin term
+							# Reverse 12 and 2
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos
+
+	mulsd	%xmm6,%xmm8				# sin *x3
+	mulsd	%xmm7,%xmm9				# sin *x3
+	mulsd	%xmm10,%xmm4				# cos *x4
+	mulsd	%xmm11,%xmm5				# cos *x4
+
+	mulsd	p_temp+8(%rsp),%xmm12		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1+8(%rsp),%xmm13		# 0.5 * x2 * xx for sin term
+	movsd	%xmm2,%xmm6				# Keep high r for cos term
+	movsd	%xmm3,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	# -t=r-1.0
+
+	subsd	%xmm12,%xmm8				# sin - 0.5 * x2 *xx
+	subsd	%xmm13,%xmm9				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term
+							# Reverse 10 and 0
+
+	mulsd	p_temp(%rsp),%xmm0		# x * xx
+	mulsd	p_temp1(%rsp),%xmm1		# x * xx
+
+	movsd	%xmm2,%xmm12				# move -t for cos term
+	movsd	%xmm3,%xmm13				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm2 	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm3 	# 1+(-t)
+	addsd	p_temp+8(%rsp),%xmm8			# sin+xx
+	addsd	p_temp1+8(%rsp),%xmm9			# sin+xx
+
+	subsd	%xmm6,%xmm2				# (1-t) - r
+	subsd	%xmm7,%xmm3				# (1-t) - r
+
+	subsd	%xmm0,%xmm2				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm1,%xmm3				# ((1 + (-t)) - r) - x*xx
+
+	addsd	%xmm10,%xmm8				# sin + x
+	addsd	%xmm11,%xmm9				# sin + x
+
+	addsd   %xmm2,%xmm4				# cos+((1-t)-r - x*xx)
+	addsd   %xmm3,%xmm5				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm12,%xmm4				# cos+t
+	subsd   %xmm13,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:					# changed from sincos_sincos
+							# xmm1 is cossin and xmm0 is sincos
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm0,p_temp2(%rsp)		# Store r
+
+
+	movapd	.Lcossinarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lsincosarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lcossinarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lsincosarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term	+
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movhlps	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term   	+
+	mulsd	p_temp2+8(%rsp),%xmm6		# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term		+
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high 0.5*x2 for sin term
+	movhlps	%xmm3,%xmm13				# move high r for cos
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm9				# xmm9 = cos , xmm5 = sin
+
+	mulsd	%xmm6,%xmm8				# sin *x3
+	mulsd	%xmm11,%xmm9				# cos *x4
+	mulsd	%xmm10,%xmm4				# cos *x4
+	mulsd	%xmm7,%xmm5				# sin *x3
+
+	mulsd	p_temp+8(%rsp),%xmm12		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1(%rsp),%xmm3		# 0.5 * x2 * xx for sin term
+
+	movsd	%xmm2,%xmm6				# Keep high r for cos term
+	movsd	%xmm13,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	# -t=r-1.0
+
+	subsd	%xmm12,%xmm8				# sin - 0.5 * x2 *xx
+	subsd	%xmm3,%xmm5				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x*xx for cos term
+
+	mulsd	p_temp(%rsp),%xmm0		# x * xx
+	mulsd	p_temp1+8(%rsp),%xmm11		# x * xx
+
+	movsd	%xmm2,%xmm12				# move -t for cos term
+	movsd	%xmm13,%xmm3				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm2 	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm13 	# 1+(-t)
+
+	addsd	p_temp+8(%rsp),%xmm8		# sin+xx
+	addsd	p_temp1(%rsp),%xmm5		# sin+xx
+
+	subsd	%xmm6,%xmm2				# (1-t) - r
+	subsd	%xmm7,%xmm13				# (1-t) - r
+
+	subsd	%xmm0,%xmm2				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm11,%xmm13				# ((1 + (-t)) - r) - x*xx
+
+
+	addsd	%xmm10,%xmm8				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+
+	addsd   %xmm2,%xmm4				# cos+((1-t)-r - x*xx)
+	addsd   %xmm13,%xmm9				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm12,%xmm4				# cos+t
+	subsd   %xmm3,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	movapd	 %xmm2,p_temp2(%rsp)			# store x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	movapd	 %xmm11,p_temp3(%rsp)			# store r
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for 0.5*x2
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm10					# x6
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm12	# 0.5 *x2
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm10,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm0,%xmm2					# x3 recalculate
+	mulpd	%xmm3,%xmm3					# x4 recalculate
+
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm6,%xmm12					# 0.5 * x2 *xx
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd	%xmm12,%xmm4					# -0.5 * x2 *xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd	%xmm6,%xmm4					# x3 * zs +xx
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# t relaculate, -t = r-1
+	addpd	%xmm0,%xmm4					# +x
+	subpd   %xmm13,%xmm5					# + t
+
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	movapd	 %xmm3,p_temp3(%rsp)			# store x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	movapd	 %xmm10,p_temp2(%rsp)			# store r
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	mulpd	%xmm3,%xmm11					# x4
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for 0.5*x2
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm11					# x6
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm13	# 0.5 *x2
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm2					# x4 recalculate
+	mulpd	%xmm1,%xmm3					# x3 recalculate
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm7,%xmm13					# 0.5 * x2 *xx
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x3 * zs
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;;
+	subpd	%xmm13,%xmm5					# -0.5 * x2 *xx
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd	%xmm7,%xmm5					# +xx
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12	# t relaculate, -t = r-1
+	addpd	%xmm1,%xmm5					# +x
+	subpd   %xmm12,%xmm4					# + t
+
+	jmp 	.L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4:				#Derive from cossin_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+	movhlps	%xmm10,%xmm10					# get upper r for t for cos
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	movsd	%xmm0,%xmm8					# lower x for sin
+	mulsd	%xmm2,%xmm8					# lower x3 for sin
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# upper x4 for cos
+	movsd	%xmm8,%xmm2					# lower x3 for sin
+
+	movsd	 %xmm6,%xmm9					# lower xx
+								# note using odd reg
+
+	movlpd   p_temp2+8(%rsp),%xmm12		# upper r for cos term
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx for upper cos term
+	mulpd	%xmm1,%xmm7					# x * xx
+	movhlps	%xmm6,%xmm6
+	mulsd	p_temp2(%rsp),%xmm9 			# xx * 0.5*x2 for sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+								# x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm8= cos, xmm4= sin
+
+	subsd	%xmm9,%xmm4					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addsd   %xmm10,%xmm8					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp(%rsp),%xmm4			# +xx
+
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	subsd   %xmm12,%xmm8					# + t
+	addsd	%xmm0,%xmm4					# +x
+	subpd   %xmm13,%xmm5					# + t
+
+	movlhps	%xmm8,%xmm4
+
+	jmp	.L__vrd4_cos_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4:		#Derive from sincos_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcossinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcossinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zszc
+	addpd	%xmm9,%xmm5					# z
+
+	mulpd	%xmm0,%xmm2					# upper x3 for sin
+	mulsd	%xmm0,%xmm2					# lower x4 for cos
+	mulpd	%xmm3,%xmm3					# x4
+
+	movhlps	%xmm6,%xmm9					# upper xx for sin term
+								# note using odd reg
+
+	movlpd  p_temp2(%rsp),%xmm12			# lower r for cos term
+	movapd  p_temp3(%rsp),%xmm13			# r
+
+
+	mulpd	%xmm0,%xmm6					# x * xx for lower cos term
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	mulsd	p_temp2+8(%rsp),%xmm9 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+	mulpd	%xmm3,%xmm5
+								# x4 * zc
+
+	movhlps	%xmm4,%xmm8					# xmm8= sin, xmm4= cos
+	subsd	%xmm9,%xmm8					# x3zs - 0.5*x2*xx
+
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addsd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp+8(%rsp),%xmm8			# +xx
+
+	movhlps	%xmm0,%xmm0					# upper x for sin
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	subsd   %xmm12,%xmm4					# + t
+	subpd   %xmm13,%xmm5					# + t
+	addsd	%xmm0,%xmm8					# +x
+
+	movlhps	%xmm8,%xmm4
+
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+	movhlps	%xmm11,%xmm11					# get upper r for t for cos
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10		# 1 + (-t)	;trash t
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11		# 1 + (-t)	;trash t
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsincosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zcs
+
+	movsd	%xmm1,%xmm9					# lower x for sin
+	mulsd	%xmm3,%xmm9					# lower x3 for sin
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# upper x4 for cos
+	movsd	%xmm9,%xmm3					# lower x3 for sin
+
+	movsd	 %xmm7,%xmm8					# lower xx
+								# note using even reg
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+	movlpd   p_temp3+8(%rsp),%xmm13		# upper r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx for upper cos term
+	movhlps	%xmm7,%xmm7
+	mulsd	p_temp3(%rsp),%xmm8 			# xx * 0.5*x2 for sin term
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+								# x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= cos, xmm5= sin
+
+	subsd	%xmm8,%xmm5					# x3zs - 0.5*x2*xx
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd   %xmm11,%xmm9					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1(%rsp),%xmm5			# +xx
+
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12		# t relaculate, -t = r-1
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13		# t relaculate, -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subsd   %xmm13,%xmm9					# + t
+	addsd	%xmm1,%xmm5					# +x
+
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:		# Derived from sincos_sinsin
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	movhlps	%xmm11,%xmm11
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm6,%xmm10					# 0.5*x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsincosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zczs
+
+	movsd	%xmm3,%xmm12
+	mulsd	%xmm1,%xmm12					# low x3 for sin
+
+	mulpd	%xmm0,%xmm2					# x3
+	mulpd	%xmm3,%xmm3					# high x4 for cos
+	movsd	%xmm12,%xmm3					# low x3 for sin
+
+	movhlps	%xmm1,%xmm8					# upper x for cos term
+								# note using even reg
+	movlpd  p_temp3+8(%rsp),%xmm13			# upper r for cos term
+
+	mulsd	p_temp1+8(%rsp),%xmm8			# x * xx for upper cos term
+
+	mulsd	p_temp3(%rsp),%xmm7 			# xx * 0.5*x2 for lower sin term
+
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= cos, xmm5= sin
+
+	subsd	%xmm7,%xmm5					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm8,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm10,%xmm4					# x3*zs - 0.5*x2*xx
+	addsd   %xmm11,%xmm9					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1(%rsp),%xmm5			# +xx
+
+	addpd	%xmm6,%xmm4					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+
+	addsd	%xmm1,%xmm5					# +x
+	addpd	%xmm0,%xmm4					# +x
+	subsd   %xmm13,%xmm9					# + t
+
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcossinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcossinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcossinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm7,%xmm8					# upper xx for sin term
+								# note using even reg
+
+	movapd  p_temp2(%rsp),%xmm12			# r
+	movlpd  p_temp3(%rsp),%xmm13			# lower r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx for lower cos term
+
+	mulsd	p_temp3+8(%rsp),%xmm8 			# xx * 0.5*x2 for upper sin term
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm8,%xmm9					# x3zs - 0.5*x2*xx
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1+8(%rsp),%xmm9			# +xx
+
+	movhlps	%xmm1,%xmm1					# upper x for sin
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subsd   %xmm13,%xmm5					# + t
+	addsd	%xmm1,%xmm9					# +x
+
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsincos_sinsin_piby4:		# Derived from sincos_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcossinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcossinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm6,%xmm10					# 0.5x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcossinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm0,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm7,%xmm8					# upper xx for sin term
+								# note using even reg
+
+	movlpd  p_temp3(%rsp),%xmm13			# lower r for cos term
+
+	mulpd	%xmm1,%xmm7					# x * xx for lower cos term
+
+	mulsd	p_temp3+8(%rsp),%xmm8 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm8,%xmm9					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm10,%xmm4					# x3*zs - 0.5*x2*xx
+	addsd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1+8(%rsp),%xmm9			# +xx
+
+	movhlps	%xmm1,%xmm1					# upper x for sin
+	addpd	%xmm6,%xmm4					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	addsd	%xmm1,%xmm9					# +x
+	addpd	%xmm0,%xmm4					# +x
+	subsd   %xmm13,%xmm5					# + t
+
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsinsin_cossin_piby4:		# Derived from sincos_sinsin
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# x2
+	movapd	 %xmm6,p_temp(%rsp)			# xx
+
+	movhlps	%xmm10,%xmm10
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm7,%xmm11					# 0.5*x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zs
+
+
+	movsd	%xmm2,%xmm13
+	mulsd	%xmm0,%xmm13					# low x3 for sin
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm2,%xmm2					# high x4 for cos
+	movsd	%xmm13,%xmm2					# low x3 for sin
+
+
+	movhlps	%xmm0,%xmm9					# upper x for cos term								; note using even reg
+	movlpd  p_temp2+8(%rsp),%xmm12			# upper r for cos term
+	mulsd	p_temp+8(%rsp),%xmm9			# x * xx for upper cos term
+	mulsd	p_temp2(%rsp),%xmm6 			# xx * 0.5*x2 for lower sin term
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm8= cos, xmm4= sin
+	subsd	%xmm6,%xmm4					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm9,%xmm10					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm11,%xmm5					# x3*zs - 0.5*x2*xx
+
+	addsd   %xmm10,%xmm8					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp(%rsp),%xmm4			# +xx
+
+	addpd	%xmm7,%xmm5					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+
+	addsd	%xmm0,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+	subsd   %xmm12,%xmm8					# + t
+	movlhps	%xmm8,%xmm4
+
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:		# Derived from sincos_coscos
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lcossinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcossinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm7,%xmm11					# 0.5x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm0,%xmm2					# upper x3 for sin
+	mulsd	%xmm0,%xmm2					# lower x4 for cos
+
+	movhlps	%xmm6,%xmm9					# upper xx for sin term
+								# note using even reg
+
+	movlpd  p_temp2(%rsp),%xmm12			# lower r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx for lower cos term
+
+	mulsd	p_temp2+8(%rsp),%xmm9 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm9,%xmm8					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm11,%xmm5					# x3*zs - 0.5*x2*xx
+	addsd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp+8(%rsp),%xmm8			# +xx
+
+	movhlps	%xmm0,%xmm0					# upper x for sin
+	addpd	%xmm7,%xmm5					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+
+
+	addsd	%xmm0,%xmm8					# +x
+	addpd	%xmm1,%xmm5					# +x
+	subsd   %xmm12,%xmm4					# + t
+
+	movlhps	%xmm8,%xmm4
+
+	jmp 	.L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	movapd	 %xmm2,p_temp2(%rsp)			# copy of x2
+	movapd	 %xmm3,p_temp3(%rsp)			# copy of x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm2,%xmm10					# x6
+	mulpd	%xmm3,%xmm11					# x6
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm2		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm3		# 0.5 *x2
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm6,%xmm2					# 0.5 * x2 *xx
+	mulpd	%xmm7,%xmm3					# 0.5 * x2 *xx
+
+	mulpd	%xmm10,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zs
+
+	movapd	p_temp2(%rsp),%xmm10			# x2
+	movapd	p_temp3(%rsp),%xmm11			# x2
+
+	mulpd	%xmm0,%xmm10					# x3
+	mulpd	%xmm1,%xmm11					# x3
+
+	mulpd	%xmm10,%xmm4					# x3 * zs
+	mulpd	%xmm11,%xmm5					# x3 * zs
+
+	subpd	%xmm2,%xmm4					# -0.5 * x2 *xx
+	subpd	%xmm3,%xmm5					# -0.5 * x2 *xx
+
+	addpd	%xmm6,%xmm4					# +xx
+	addpd	%xmm7,%xmm5					# +xx
+
+	addpd	%xmm0,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+
+	jmp 	.L__vrd4_cos_cleanup

diff --git a/src/gas/vrd4exp.S b/src/gas/vrd4exp.S
new file mode 100644
index 0000000..a05af8b
--- /dev/null
+++ b/src/gas/vrd4exp.S

@@ -0,0 +1,502 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd4exp.s
+#
+# A vector implementation of the exp libm function.
+#
+# Prototype:
+#
+#    __m128d,__m128d __vrd4_exp(__m128d x1, __m128d x2);
+#
+#   Computes e raised to the x power for an array of input values.
+#   Places the results into the supplied y array.
+# Does not perform error checking.   Denormal results are truncated to 0.
+#
+# This routine computes 4 double precision exponent values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ	p_temp,0		# temporary for get/put bits operation
+.equ	p_temp1,0x10		# temporary for exponent multiply
+
+.equ	save_rbx,0x020		#qword
+.equ	save_rdi,0x028		#qword
+
+.equ	save_rsi,0x030		#qword
+
+
+
+.equ	p2_temp,0x40		# second temporary for get/put bits operation
+.equ	p2_temp1,0x60		# second temporary for exponent multiply
+
+
+.equ	stack_size,0x088
+
+
+# parameters are passed in by Linux as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrd4_exp
+    .type   __vrd4_exp,@function
+__vrd4_exp:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rdi
+		movapd	%xmm1,%xmm6
+
+# process 4 values at a time.
+
+	movapd	.L__real_thirtytwo_by_log2(%rip),%xmm3	#
+
+#      Step 1. Reduce the argument.
+#        /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+#    r = x * thirtytwo_by_logbaseof2;
+		movapd	%xmm3,%xmm7
+	movapd	 %xmm0,p_temp(%rsp)
+	maxpd	.L__real_C0F0000000000000(%rip),%xmm0	# protect against very large negative, non-infinite numbers
+	mulpd	%xmm0,%xmm3
+
+		movapd	 %xmm6,p2_temp(%rsp)
+		maxpd	.L__real_C0F0000000000000(%rip),%xmm6
+		mulpd	%xmm6,%xmm7
+
+# save x for later.
+        minpd   .L__real_40F0000000000000(%rip),%xmm3    # protect against very large, non-infinite numbers
+
+
+#    /* Set n = nearest integer to r */
+	cvtpd2dq	%xmm3,%xmm4
+	lea		.L__two_to_jby32_lead_table(%rip),%rdi
+	lea		.L__two_to_jby32_trail_table(%rip),%rsi
+	cvtdq2pd	%xmm4,%xmm1
+        	minpd   .L__real_40F0000000000000(%rip),%xmm7    # protect against very large, non-infinite numbers
+
+ #    r1 = x - n * logbaseof2_by_32_lead;
+	movapd	.L__real_log2_by_32_lead(%rip),%xmm2	#
+	mulpd	%xmm1,%xmm2					#
+	movq	 %xmm4,p_temp1(%rsp)
+	subpd	%xmm2,%xmm0					# r1 in xmm0,
+
+		cvtpd2dq	%xmm7,%xmm2
+		cvtdq2pd	%xmm2,%xmm8
+
+#    r2 =   - n * logbaseof2_by_32_trail;
+	mulpd	.L__real_log2_by_32_tail(%rip),%xmm1	# r2 in xmm1
+#    j = n & 0x0000001f;
+	mov		$0x01f,%r9
+	mov		%r9,%r8
+	mov		p_temp1(%rsp),%ecx
+	and		%ecx,%r9d
+		movq	 %xmm2,p2_temp1(%rsp)
+		movapd	.L__real_log2_by_32_lead(%rip),%xmm9
+		mulpd	%xmm8,%xmm9
+		subpd	%xmm9,%xmm6				# r1b in xmm6
+		mulpd	.L__real_log2_by_32_tail(%rip),%xmm8	# r2b in xmm8
+
+	mov             p_temp1+4(%rsp),%edx
+	and		%edx,%r8d
+#    f1 = two_to_jby32_lead_table[j];
+#    f2 = two_to_jby32_trail_table[j];
+
+#    *m = (n - j) / 32;
+	sub		%r9d,%ecx
+	sar		$5,%ecx					#m
+	sub		%r8d,%edx
+	sar		$5,%edx
+
+
+	movapd	%xmm0,%xmm2
+	addpd	%xmm1,%xmm2   # r = r1 + r2
+
+		mov		$0x01f,%r11
+		mov		%r11,%r10
+		mov		p2_temp1(%rsp),%ebx
+		and		%ebx,%r11d
+#      Step 2. Compute the polynomial.
+#    q = r1 + (r2 +
+#              r*r*( 5.00000000000000008883e-01 +
+#                      r*( 1.66666666665260878863e-01 +
+#                      r*( 4.16666666662260795726e-02 +
+#                      r*( 8.33336798434219616221e-03 +
+#                      r*( 1.38889490863777199667e-03 ))))));
+#    q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+	movapd	%xmm2,%xmm1
+	movapd	.L__real_3f56c1728d739765(%rip),%xmm3	# 	1/720
+	movapd	.L__real_3FC5555555548F7C(%rip),%xmm0	# 	1/6
+# deal with infinite results
+	mov		$1024,%rax
+	movsx	%ecx,%rcx
+	cmp		%rax,%rcx
+
+	mulpd	%xmm2,%xmm3						# *x
+	mulpd	%xmm2,%xmm0						# *x
+	mulpd	%xmm2,%xmm1						# x*x
+	movapd	%xmm1,%xmm4
+
+	cmovg	%rax,%rcx				## if infinite, then set rcx to multiply
+							# by infinity
+	movsx	%edx,%rdx
+	cmp		%rax,%rdx
+
+		movapd	%xmm6,%xmm9
+		addpd	%xmm8,%xmm9  #  rb = r1b + r2b
+	addpd	.L__real_3F811115B7AA905E(%rip),%xmm3	# 	+ 1/120
+	addpd	.L__real_3fe0000000000000(%rip),%xmm0	# 	+ .5
+	mulpd	%xmm1,%xmm4						# x^4
+	mulpd	%xmm2,%xmm3						# *x
+
+	cmovg	%rax,%rdx				## if infinite, then set rcx to multiply
+							# by infinity
+# deal with denormal results
+	xor		%rax,%rax
+	add		$1023,%rcx				# add bias
+
+	mulpd	%xmm1,%xmm0					# *x^2
+	addpd	.L__real_3FA5555555545D4E(%rip),%xmm3	# 	+ 1/24
+	addpd	%xmm2,%xmm0					# 	+ x
+	mulpd	%xmm4,%xmm3					# *x^4
+
+# check for infinity or nan
+	movapd	p_temp(%rsp),%xmm2
+
+	cmovs	%rax,%rcx				## if denormal, then multiply by 0
+	shl		$52,%rcx				# build 2^n
+
+		sub		%r11d,%ebx
+		movapd	%xmm9,%xmm1
+	addpd	%xmm3,%xmm0					# q = final sum
+		movapd	.L__real_3f56c1728d739765(%rip),%xmm7	# 	1/720
+		movapd	.L__real_3FC5555555548F7C(%rip),%xmm3	# 	1/6
+
+#    *z2 = f2 + ((f1 + f2) * q);
+        movlpd  (%rsi,%r9,8),%xmm5              # f2
+        movlpd  (%rsi,%r8,8),%xmm4              # f2
+        addsd   (%rdi,%r8,8),%xmm4              # f1 + f2
+        addsd   (%rdi,%r9,8),%xmm5              # f1 + f2
+                mov p2_temp1+4(%rsp),%r8d
+		and		%r8d,%r10d
+		sar		$5,%ebx					#m
+		mulpd	%xmm9,%xmm7					# *x
+		mulpd	%xmm9,%xmm3					# *x
+		mulpd	%xmm9,%xmm1					# x*x
+		sub		%r10d,%r8d
+		sar		$5,%r8d
+# check for infinity or nan
+	andpd	.L__real_infinity(%rip),%xmm2
+	cmppd   $0,.L__real_infinity(%rip),%xmm2
+	add		$1023,%rdx				# add bias
+	shufpd	$0,%xmm4,%xmm5
+		movapd	%xmm1,%xmm4
+
+	cmovs	%rax,%rdx				## if denormal, then multiply by 0
+	shl		$52,%rdx					# build 2^n
+
+	mulpd	%xmm5,%xmm0
+	mov		 %rcx,p_temp1(%rsp) # get 2^n to memory
+	mov		 %rdx,p_temp1+8(%rsp) # get 2^n to memory
+	addpd	%xmm5,%xmm0					#z = z1 + z2
+		mov		$1024,%rax
+		movsx	%ebx,%rbx
+		cmp		%rax,%rbx
+# end of splitexp
+#        /* Scale (z1 + z2) by 2.0**m */
+#          r = scaleDouble_1(z, n);
+
+
+		cmovg	%rax,%rbx			## if infinite, then set rcx to multiply
+							# by infinity
+		movsx	%r8d,%rdx
+		cmp		%rax,%rdx
+
+	movmskpd	%xmm2,%r8d
+
+		addpd	.L__real_3F811115B7AA905E(%rip),%xmm7	# 	+ 1/120
+		addpd	.L__real_3fe0000000000000(%rip),%xmm3	# 	+ .5
+		mulpd	%xmm1,%xmm4				# x^4
+		mulpd	%xmm9,%xmm7				# *x
+		cmovg	%rax,%rdx			## if infinite, then set rcx to multiply
+
+
+		xor		%rax,%rax
+		add		$1023,%rbx		# add bias
+
+		mulpd	%xmm1,%xmm3			# *x^2
+		addpd	.L__real_3FA5555555545D4E(%rip),%xmm7	# 	+ 1/24
+		addpd	%xmm9,%xmm3			# 	+ x
+		mulpd	%xmm4,%xmm7			# *x^4
+
+		cmovs	%rax,%rbx			## if denormal, then multiply by 0
+		shl		$52,%rbx		# build 2^n
+
+#      Step 3. Reconstitute.
+
+	mulpd	p_temp1(%rsp),%xmm0	# result *= 2^n
+		addpd	%xmm7,%xmm3			# q = final sum
+
+                movlpd  (%rsi,%r11,8),%xmm5             # f2
+                movlpd  (%rsi,%r10,8),%xmm4             # f2
+                addsd   (%rdi,%r10,8),%xmm4             # f1 + f2
+                addsd   (%rdi,%r11,8),%xmm5             # f1 + f2
+
+		add		$1023,%rdx		# add bias
+		cmovs	%rax,%rdx			## if denormal, then multiply by 0
+		shufpd	$0,%xmm4,%xmm5
+		shl		$52,%rdx		# build 2^n
+
+		mulpd	%xmm5,%xmm3
+		mov		 %rbx,p2_temp1(%rsp) # get 2^n to memory
+		mov		 %rdx,p2_temp1+8(%rsp) # get 2^n to memory
+		addpd	%xmm5,%xmm3			#z = z1 + z2
+
+		movapd	p2_temp(%rsp),%xmm2
+		andpd	.L__real_infinity(%rip),%xmm2
+		cmppd	$0,.L__real_infinity(%rip),%xmm2
+		movmskpd	%xmm2,%ebx
+	test		$3,%r8d
+		mulpd	p2_temp1(%rsp),%xmm3	# result *= 2^n
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them.  But it adds cycles for normal cases which
+# are supposed to be exceptions.  Using this branch with the
+# check above results in faster code for the normal cases.
+	jnz			.L__exp_naninf
+
+.L__vda_bottom1:
+# store the result _m128d
+		test		$3,%ebx
+		jnz		.L__exp_naninf2
+
+.L__vda_bottom2:
+
+	movapd	%xmm3,%xmm1
+
+
+#
+#
+.L__final_check:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+# at least one of the numbers needs special treatment
+.L__exp_naninf:
+	lea		p_temp(%rsp),%rcx
+	call  .L__naninf
+	jmp		.L__vda_bottom1
+.L__exp_naninf2:
+	lea		p2_temp(%rsp),%rcx
+	mov			%ebx,%r8d
+	movapd	%xmm0,%xmm4
+	movapd	%xmm3,%xmm0
+	call  .L__naninf
+	movapd	%xmm0,%xmm3
+	movapd	%xmm4,%xmm0
+	jmp		.L__vda_bottom2
+
+# This subroutine checks a double pair for nans and infinities and
+# produces the proper result from the exceptional inputs
+# Register assumptions:
+# Inputs:
+# r8d - mask of errors
+# xmm0 - computed result vector
+# rcx - pointing to memory image of inputs
+# Outputs:
+# xmm0 - new result vector
+# %rax,rdx,,%xmm2 all modified.
+.L__naninf:
+# check the first number
+	test	$1,%r8d
+	jz		.L__check2
+
+	mov		(%rcx),%rdx
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__enan1			# jump if mantissa not zero, so it's a NaN
+# inf
+	mov		%rdx,%rax
+	rcl		$1,%rax
+	jnc		.L__r1				# exp(+inf) = inf
+	xor		%rdx,%rdx				# exp(-inf) = 0
+	jmp		.L__r1
+
+#NaN
+.L__enan1:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__r1:
+	movd	%rdx,%xmm2
+	shufpd	$2,%xmm0,%xmm2
+	movsd	%xmm2,%xmm0
+# check the second number
+.L__check2:
+	test	$2,%r8d
+	jz		.L__r3
+	mov		8(%rcx),%rdx
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__enan2			# jump if mantissa not zero, so it's a NaN
+# inf
+	mov		%rdx,%rax
+	rcl		$1,%rax
+	jnc		.L__r2				# exp(+inf) = inf
+	xor		%rdx,%rdx				# exp(-inf) = 0
+	jmp		.L__r2
+
+#NaN
+.L__enan2:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__r2:
+	movd	%rdx,%xmm2
+	shufpd	$0,%xmm2,%xmm0
+.L__r3:
+	ret
+
+	.data
+	.align	64
+
+.L__real_3ff0000000000000: 	.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_4040000000000000:	.quad 0x04040000000000000	# 32
+				.quad 0x04040000000000000
+.L__real_40F0000000000000:      .quad 0x040F0000000000000        # 65536, to protect against really large numbers
+                                .quad 0x040F0000000000000
+.L__real_C0F0000000000000:	.quad 0x0C0F0000000000000	# -65536, to protect against really large negative numbers
+				.quad 0x0C0F0000000000000
+.L__real_3FA0000000000000:	.quad 0x03FA0000000000000	# 1/32
+				.quad 0x03FA0000000000000
+.L__real_3fe0000000000000:	.quad 0x03fe0000000000000	# 1/2
+				.quad 0x03fe0000000000000
+.L__real_infinity:		.quad 0x07ff0000000000000	#
+				.quad 0x07ff0000000000000
+.L__real_ninfinity:		.quad 0x0fff0000000000000	#
+				.quad 0x0fff0000000000000
+.L__real_thirtytwo_by_log2: 	.quad 0x040471547652b82fe	# thirtytwo_by_log2
+				.quad 0x040471547652b82fe
+.L__real_log2_by_32_lead:  	.quad 0x03f962e42fe000000	# log2_by_32_lead
+				.quad 0x03f962e42fe000000
+.L__real_log2_by_32_tail:  	.quad 0x0Bdcf473de6af278e	# -log2_by_32_tail
+				.quad 0x0Bdcf473de6af278e
+.L__real_3f56c1728d739765:	.quad 0x03f56c1728d739765	# 1.38889490863777199667e-03
+				.quad 0x03f56c1728d739765
+.L__real_3F811115B7AA905E:	.quad 0x03F811115B7AA905E	# 8.33336798434219616221e-03
+				.quad 0x03F811115B7AA905E
+.L__real_3FA5555555545D4E:	.quad 0x03FA5555555545D4E	# 4.16666666662260795726e-02
+				.quad 0x03FA5555555545D4E
+.L__real_3FC5555555548F7C:	.quad 0x03FC5555555548F7C	# 1.66666666665260878863e-01
+				.quad 0x03FC5555555548F7C
+
+
+.L__two_to_jby32_lead_table:
+	.quad	0x03ff0000000000000 # 1
+	.quad	0x03ff059b0d0000000		# 1.0219
+	.quad	0x03ff0b55860000000		# 1.04427
+	.quad	0x03ff11301d0000000		# 1.06714
+	.quad	0x03ff172b830000000		# 1.09051
+	.quad	0x03ff1d48730000000		# 1.11439
+	.quad	0x03ff2387a60000000		# 1.13879
+	.quad	0x03ff29e9df0000000		# 1.16372
+	.quad	0x03ff306fe00000000		# 1.18921
+	.quad	0x03ff371a730000000		# 1.21525
+	.quad	0x03ff3dea640000000		# 1.24186
+	.quad	0x03ff44e0860000000		# 1.26905
+	.quad	0x03ff4bfdad0000000		# 1.29684
+	.quad	0x03ff5342b50000000		# 1.32524
+	.quad	0x03ff5ab07d0000000		# 1.35426
+	.quad	0x03ff6247eb0000000		# 1.38391
+	.quad	0x03ff6a09e60000000		# 1.41421
+	.quad	0x03ff71f75e0000000		# 1.44518
+	.quad	0x03ff7a11470000000		# 1.47683
+	.quad	0x03ff8258990000000		# 1.50916
+	.quad	0x03ff8ace540000000		# 1.54221
+	.quad	0x03ff93737b0000000		# 1.57598
+	.quad	0x03ff9c49180000000		# 1.61049
+	.quad	0x03ffa5503b0000000		# 1.64576
+	.quad	0x03ffae89f90000000		# 1.68179
+	.quad	0x03ffb7f76f0000000		# 1.71862
+	.quad	0x03ffc199bd0000000		# 1.75625
+	.quad	0x03ffcb720d0000000		# 1.79471
+	.quad	0x03ffd5818d0000000		# 1.83401
+	.quad	0x03ffdfc9730000000		# 1.87417
+	.quad	0x03ffea4afa0000000		# 1.91521
+	.quad	0x03fff507650000000		# 1.95714
+	.quad 0					# for alignment
+.L__two_to_jby32_trail_table:
+	.quad	0x00000000000000000 # 0
+	.quad	0x03e48ac2ba1d73e2a		# 1.1489e-008
+	.quad	0x03e69f3121ec53172		# 4.83347e-008
+	.quad	0x03df25b50a4ebbf1b		# 2.67125e-010
+	.quad	0x03e68faa2f5b9bef9		# 4.65271e-008
+	.quad	0x03e368b9aa7805b80		# 5.24924e-009
+	.quad	0x03e6ceac470cd83f6		# 5.38622e-008
+	.quad	0x03e547f7b84b09745		# 1.90902e-008
+	.quad	0x03e64636e2a5bd1ab		# 3.79764e-008
+	.quad	0x03e5ceaa72a9c5154		# 2.69307e-008
+	.quad	0x03e682468446b6824		# 4.49684e-008
+	.quad	0x03e18624b40c4dbd0		# 1.41933e-009
+	.quad	0x03e54d8a89c750e5e		# 1.94147e-008
+	.quad	0x03e5a753e077c2a0f		# 2.46409e-008
+	.quad	0x03e6a90a852b19260		# 4.94813e-008
+	.quad	0x03e0d2ac258f87d03		# 8.48872e-010
+	.quad	0x03e59fcef32422cbf		# 2.42032e-008
+	.quad	0x03e61d8bee7ba46e2		# 3.3242e-008
+	.quad	0x03e4f580c36bea881		# 1.45957e-008
+	.quad	0x03e62999c25159f11		# 3.46453e-008
+	.quad	0x03e415506dadd3e2a		# 8.0709e-009
+	.quad	0x03e29b8bc9e8a0388		# 2.99439e-009
+	.quad	0x03e451f8480e3e236		# 9.83622e-009
+	.quad	0x03e41f12ae45a1224		# 8.35492e-009
+	.quad	0x03e62b5a75abd0e6a		# 3.48493e-008
+	.quad	0x03e47daf237553d84		# 1.11085e-008
+	.quad	0x03e6b0aa538444196		# 5.03689e-008
+	.quad	0x03e69df20d22a0798		# 4.81896e-008
+	.quad	0x03e69f7490e4bb40b		# 4.83654e-008
+	.quad	0x03e4bdcdaf5cb4656		# 1.29746e-008
+	.quad	0x03e452486cc2c7b9d		# 9.84533e-009
+	.quad	0x03e66dc8a80ce9f09		# 4.25828e-008
+	.quad 0					# for alignment
+
+
+

diff --git a/src/gas/vrd4frcpa.S b/src/gas/vrd4frcpa.S
new file mode 100644
index 0000000..3ae0b91
--- /dev/null
+++ b/src/gas/vrd4frcpa.S

@@ -0,0 +1,1181 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd4frcpa.asm
+#
+# A vector implementation of the floating point reciprocal approximation function.
+# The goal is to be faster than a divide.  This routine provides four double
+# precision results from four double precision inputs.  It would not be necessary
+## if SSE defined a double precision instruction similar to the single precision
+# rcpss.
+#
+# Prototype:
+#
+#    __m128d,__m128d __vrd4_frcpa(__m128d x1, __m128d x2);
+#
+#   Computes an approximate reciprocal of x.
+#   A table lookup is performed on the higher 10 bits of the mantissa
+#   (not including the implicit bit).
+#
+#
+#
+# This routine computes 4 double precision frcpa values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.
+#
+
+
+# define local variable storage offsets
+.equ	p_x,0		# temporary for get/put bits operation
+.equ	p_x2,0x10		# temporary for get/put bits operation
+
+.equ	stack_size,0x028
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# parameters are expected as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrd4_frcpa
+    .type   __vrd4_frcpa,@function
+__vrd4_frcpa:
+	sub		$stack_size,%rsp
+# 10 bit GPR method
+	xor		%rax,%rax
+	movdqa	.L__mask_expext(%rip),%xmm3
+		movdqa	%xmm1,%xmm6
+	movdqa	%xmm0,%xmm4
+		movdqa	%xmm3,%xmm5
+## if 1/2 bit set, increment the index+exponent
+	psrlq	$41,%xmm4
+		psrlq	$41,%xmm6
+	movdqa	%xmm4,%xmm2
+	paddq	.L__int_one(%rip),%xmm4
+	psrlq	$1,%xmm4
+	pand	.L__mask_10bits(%rip),%xmm4
+#  invert the exponent
+	psubq	%xmm2,%xmm3
+		movdqa	%xmm6,%xmm2
+		paddq	.L__int_one(%rip),%xmm6
+		psrlq	$1,%xmm6
+		pand	.L__mask_10bits(%rip),%xmm6
+		psubq	%xmm2,%xmm5
+	pand	.L__mask_expext2(%rip),%xmm3
+		pand	.L__mask_expext2(%rip),%xmm5
+	psllq	$1,%xmm3
+# do the lookup	 and recombine
+	lea		.L__rcp_table(%rip),%rdx
+
+	movdqa	%xmm4,p_x(%rsp)	# move the indexes to a memory location
+		psllq	$1,%xmm5
+	mov		p_x(%rsp),%r8		# 3 cycles faster for frcpa, but 2 cycles slower for log
+	mov		p_x+8(%rsp),%r9
+		movdqa	%xmm6,p_x2(%rsp)	# move the indexes to a memory location
+	movd	(%rdx,%r9,4),%xmm2  	# lookup
+	movd	(%rdx,%r8,4),%xmm4  	# lookup
+	pslldq	$8,%xmm2						# shift by 8 bytes
+	por		%xmm4,%xmm2
+	por		%xmm2,%xmm3
+		mov		p_x2(%rsp),%r8		# 3 cycles faster for frcpa, but 2 cycles slower for log
+		mov		p_x2+8(%rsp),%r9
+		movd	(%rdx,%r9,4),%xmm2  	# lookup
+		movd	(%rdx,%r8,4),%xmm4  	# lookup
+		pslldq	$8,%xmm2						# shift by 8 bytes
+		por		%xmm4,%xmm2
+		por		%xmm2,%xmm5
+# shift and restore the sign
+	pand	.L__mask_sign(%rip),%xmm0
+		pand	.L__mask_sign(%rip),%xmm1
+	psllq	$40,%xmm3
+		psllq	$40,%xmm5
+	por		%xmm3,%xmm0
+		por		%xmm5,%xmm1
+	add		$stack_size,%rsp
+	ret
+
+
+        .data
+        .align  16
+
+.L__int_one:		.quad 0x00000000000000001
+			.quad 0x00000000000000001
+
+.L__mask_10bits:	.quad 0x000000000000003ff
+			.quad 0x000000000000003ff
+
+.L__mask_expext:	.quad 0x000000000003ff000
+			.quad 0x000000000003ff000
+
+.L__mask_expext2:	.quad 0x000000000003ff800
+			.quad 0x000000000003ff800
+
+.L__mask_sign:		.quad 0x08000000000000000
+			.quad 0x08000000000000000
+
+.L__real_one:		.quad 0x03ff0000000000000
+			.quad 0x03ff0000000000000
+.L__real_two:		.quad 0x04000000000000000
+			.quad 0x04000000000000000
+
+        .align  16
+
+.L__rcp_table:
+   .long   0x0000
+   .long   0x0FF8
+   .long   0x0FF0
+   .long   0x0FE8
+   .long   0x0FE0
+   .long   0x0FD8
+   .long   0x0FD0
+   .long   0x0FC8
+   .long   0x0FC0
+   .long   0x0FB8
+   .long   0x0FB1
+   .long   0x0FA9
+   .long   0x0FA1
+   .long   0x0F99
+   .long   0x0F91
+   .long   0x0F89
+   .long   0x0F82
+   .long   0x0F7A
+   .long   0x0F72
+   .long   0x0F6B
+   .long   0x0F63
+   .long   0x0F5B
+   .long   0x0F53
+   .long   0x0F4C
+   .long   0x0F44
+   .long   0x0F3D
+   .long   0x0F35
+   .long   0x0F2D
+   .long   0x0F26
+   .long   0x0F1E
+   .long   0x0F17
+   .long   0x0F0F
+   .long   0x0F08
+   .long   0x0F00
+   .long   0x0EF8
+   .long   0x0EF1
+   .long   0x0EEA
+   .long   0x0EE2
+   .long   0x0EDB
+   .long   0x0ED3
+   .long   0x0ECC
+   .long   0x0EC4
+   .long   0x0EBD
+   .long   0x0EB6
+   .long   0x0EAE
+   .long   0x0EA7
+   .long   0x0EA0
+   .long   0x0E98
+   .long   0x0E91
+   .long   0x0E8A
+   .long   0x0E82
+   .long   0x0E7B
+   .long   0x0E74
+   .long   0x0E6D
+   .long   0x0E65
+   .long   0x0E5E
+   .long   0x0E57
+   .long   0x0E50
+   .long   0x0E49
+   .long   0x0E41
+   .long   0x0E3A
+   .long   0x0E33
+   .long   0x0E2C
+   .long   0x0E25
+   .long   0x0E1E
+   .long   0x0E17
+   .long   0x0E10
+   .long   0x0E09
+   .long   0x0E02
+   .long   0x0DFB
+   .long   0x0DF4
+   .long   0x0DED
+   .long   0x0DE6
+   .long   0x0DDF
+   .long   0x0DD8
+   .long   0x0DD1
+   .long   0x0DCA
+   .long   0x0DC3
+   .long   0x0DBC
+   .long   0x0DB5
+   .long   0x0DAE
+   .long   0x0DA7
+   .long   0x0DA0
+   .long   0x0D9A
+   .long   0x0D93
+   .long   0x0D8C
+   .long   0x0D85
+   .long   0x0D7E
+   .long   0x0D77
+   .long   0x0D71
+   .long   0x0D6A
+   .long   0x0D63
+   .long   0x0D5C
+   .long   0x0D56
+   .long   0x0D4F
+   .long   0x0D48
+   .long   0x0D42
+   .long   0x0D3B
+   .long   0x0D34
+   .long   0x0D2E
+   .long   0x0D27
+   .long   0x0D20
+   .long   0x0D1A
+   .long   0x0D13
+   .long   0x0D0C
+   .long   0x0D06
+   .long   0x0CFF
+   .long   0x0CF9
+   .long   0x0CF2
+   .long   0x0CEC
+   .long   0x0CE5
+   .long   0x0CDF
+   .long   0x0CD8
+   .long   0x0CD2
+   .long   0x0CCB
+   .long   0x0CC5
+   .long   0x0CBE
+   .long   0x0CB8
+   .long   0x0CB1
+   .long   0x0CAB
+   .long   0x0CA4
+   .long   0x0C9E
+   .long   0x0C98
+   .long   0x0C91
+   .long   0x0C8B
+   .long   0x0C85
+   .long   0x0C7E
+   .long   0x0C78
+   .long   0x0C72
+   .long   0x0C6B
+   .long   0x0C65
+   .long   0x0C5F
+   .long   0x0C58
+   .long   0x0C52
+   .long   0x0C4C
+   .long   0x0C46
+   .long   0x0C3F
+   .long   0x0C39
+   .long   0x0C33
+   .long   0x0C2D
+   .long   0x0C26
+   .long   0x0C20
+   .long   0x0C1A
+   .long   0x0C14
+   .long   0x0C0E
+   .long   0x0C08
+   .long   0x0C02
+   .long   0x0BFB
+   .long   0x0BF5
+   .long   0x0BEF
+   .long   0x0BE9
+   .long   0x0BE3
+   .long   0x0BDD
+   .long   0x0BD7
+   .long   0x0BD1
+   .long   0x0BCB
+   .long   0x0BC5
+   .long   0x0BBF
+   .long   0x0BB9
+   .long   0x0BB3
+   .long   0x0BAD
+   .long   0x0BA7
+   .long   0x0BA1
+   .long   0x0B9B
+   .long   0x0B95
+   .long   0x0B8F
+   .long   0x0B89
+   .long   0x0B83
+   .long   0x0B7D
+   .long   0x0B77
+   .long   0x0B71
+   .long   0x0B6C
+   .long   0x0B66
+   .long   0x0B60
+   .long   0x0B5A
+   .long   0x0B54
+   .long   0x0B4E
+   .long   0x0B48
+   .long   0x0B43
+   .long   0x0B3D
+   .long   0x0B37
+   .long   0x0B31
+   .long   0x0B2B
+   .long   0x0B26
+   .long   0x0B20
+   .long   0x0B1A
+   .long   0x0B14
+   .long   0x0B0F
+   .long   0x0B09
+   .long   0x0B03
+   .long   0x0AFE
+   .long   0x0AF8
+   .long   0x0AF2
+   .long   0x0AED
+   .long   0x0AE7
+   .long   0x0AE1
+   .long   0x0ADC
+   .long   0x0AD6
+   .long   0x0AD0
+   .long   0x0ACB
+   .long   0x0AC5
+   .long   0x0AC0
+   .long   0x0ABA
+   .long   0x0AB4
+   .long   0x0AAF
+   .long   0x0AA9
+   .long   0x0AA4
+   .long   0x0A9E
+   .long   0x0A99
+   .long   0x0A93
+   .long   0x0A8E
+   .long   0x0A88
+   .long   0x0A83
+   .long   0x0A7D
+   .long   0x0A78
+   .long   0x0A72
+   .long   0x0A6D
+   .long   0x0A67
+   .long   0x0A62
+   .long   0x0A5C
+   .long   0x0A57
+   .long   0x0A52
+   .long   0x0A4C
+   .long   0x0A47
+   .long   0x0A41
+   .long   0x0A3C
+   .long   0x0A37
+   .long   0x0A31
+   .long   0x0A2C
+   .long   0x0A27
+   .long   0x0A21
+   .long   0x0A1C
+   .long   0x0A17
+   .long   0x0A11
+   .long   0x0A0C
+   .long   0x0A07
+   .long   0x0A01
+   .long   0x09FC
+   .long   0x09F7
+   .long   0x09F2
+   .long   0x09EC
+   .long   0x09E7
+   .long   0x09E2
+   .long   0x09DD
+   .long   0x09D7
+   .long   0x09D2
+   .long   0x09CD
+   .long   0x09C8
+   .long   0x09C3
+   .long   0x09BD
+   .long   0x09B8
+   .long   0x09B3
+   .long   0x09AE
+   .long   0x09A9
+   .long   0x09A4
+   .long   0x099E
+   .long   0x0999
+   .long   0x0994
+   .long   0x098F
+   .long   0x098A
+   .long   0x0985
+   .long   0x0980
+   .long   0x097B
+   .long   0x0976
+   .long   0x0971
+   .long   0x096C
+   .long   0x0967
+   .long   0x0962
+   .long   0x095C
+   .long   0x0957
+   .long   0x0952
+   .long   0x094D
+   .long   0x0948
+   .long   0x0943
+   .long   0x093E
+   .long   0x0939
+   .long   0x0935
+   .long   0x0930
+   .long   0x092B
+   .long   0x0926
+   .long   0x0921
+   .long   0x091C
+   .long   0x0917
+   .long   0x0912
+   .long   0x090D
+   .long   0x0908
+   .long   0x0903
+   .long   0x08FE
+   .long   0x08FA
+   .long   0x08F5
+   .long   0x08F0
+   .long   0x08EB
+   .long   0x08E6
+   .long   0x08E1
+   .long   0x08DC
+   .long   0x08D8
+   .long   0x08D3
+   .long   0x08CE
+   .long   0x08C9
+   .long   0x08C4
+   .long   0x08C0
+   .long   0x08BB
+   .long   0x08B6
+   .long   0x08B1
+   .long   0x08AC
+   .long   0x08A8
+   .long   0x08A3
+   .long   0x089E
+   .long   0x089A
+   .long   0x0895
+   .long   0x0890
+   .long   0x088B
+   .long   0x0887
+   .long   0x0882
+   .long   0x087D
+   .long   0x0879
+   .long   0x0874
+   .long   0x086F
+   .long   0x086B
+   .long   0x0866
+   .long   0x0861
+   .long   0x085D
+   .long   0x0858
+   .long   0x0853
+   .long   0x084F
+   .long   0x084A
+   .long   0x0846
+   .long   0x0841
+   .long   0x083C
+   .long   0x0838
+   .long   0x0833
+   .long   0x082F
+   .long   0x082A
+   .long   0x0825
+   .long   0x0821
+   .long   0x081C
+   .long   0x0818
+   .long   0x0813
+   .long   0x080F
+   .long   0x080A
+   .long   0x0806
+   .long   0x0801
+   .long   0x07FD
+   .long   0x07F8
+   .long   0x07F4
+   .long   0x07EF
+   .long   0x07EB
+   .long   0x07E6
+   .long   0x07E2
+   .long   0x07DD
+   .long   0x07D9
+   .long   0x07D5
+   .long   0x07D0
+   .long   0x07CC
+   .long   0x07C7
+   .long   0x07C3
+   .long   0x07BE
+   .long   0x07BA
+   .long   0x07B6
+   .long   0x07B1
+   .long   0x07AD
+   .long   0x07A9
+   .long   0x07A4
+   .long   0x07A0
+   .long   0x079B
+   .long   0x0797
+   .long   0x0793
+   .long   0x078E
+   .long   0x078A
+   .long   0x0786
+   .long   0x0781
+   .long   0x077D
+   .long   0x0779
+   .long   0x0774
+   .long   0x0770
+   .long   0x076C
+   .long   0x0768
+   .long   0x0763
+   .long   0x075F
+   .long   0x075B
+   .long   0x0757
+   .long   0x0752
+   .long   0x074E
+   .long   0x074A
+   .long   0x0746
+   .long   0x0741
+   .long   0x073D
+   .long   0x0739
+   .long   0x0735
+   .long   0x0730
+   .long   0x072C
+   .long   0x0728
+   .long   0x0724
+   .long   0x0720
+   .long   0x071C
+   .long   0x0717
+   .long   0x0713
+   .long   0x070F
+   .long   0x070B
+   .long   0x0707
+   .long   0x0703
+   .long   0x06FE
+   .long   0x06FA
+   .long   0x06F6
+   .long   0x06F2
+   .long   0x06EE
+   .long   0x06EA
+   .long   0x06E6
+   .long   0x06E2
+   .long   0x06DE
+   .long   0x06DA
+   .long   0x06D5
+   .long   0x06D1
+   .long   0x06CD
+   .long   0x06C9
+   .long   0x06C5
+   .long   0x06C1
+   .long   0x06BD
+   .long   0x06B9
+   .long   0x06B5
+   .long   0x06B1
+   .long   0x06AD
+   .long   0x06A9
+   .long   0x06A5
+   .long   0x06A1
+   .long   0x069D
+   .long   0x0699
+   .long   0x0695
+   .long   0x0691
+   .long   0x068D
+   .long   0x0689
+   .long   0x0685
+   .long   0x0681
+   .long   0x067D
+   .long   0x0679
+   .long   0x0675
+   .long   0x0671
+   .long   0x066D
+   .long   0x066A
+   .long   0x0666
+   .long   0x0662
+   .long   0x065E
+   .long   0x065A
+   .long   0x0656
+   .long   0x0652
+   .long   0x064E
+   .long   0x064A
+   .long   0x0646
+   .long   0x0643
+   .long   0x063F
+   .long   0x063B
+   .long   0x0637
+   .long   0x0633
+   .long   0x062F
+   .long   0x062B
+   .long   0x0628
+   .long   0x0624
+   .long   0x0620
+   .long   0x061C
+   .long   0x0618
+   .long   0x0614
+   .long   0x0611
+   .long   0x060D
+   .long   0x0609
+   .long   0x0605
+   .long   0x0601
+   .long   0x05FE
+   .long   0x05FA
+   .long   0x05F6
+   .long   0x05F2
+   .long   0x05EF
+   .long   0x05EB
+   .long   0x05E7
+   .long   0x05E3
+   .long   0x05E0
+   .long   0x05DC
+   .long   0x05D8
+   .long   0x05D4
+   .long   0x05D1
+   .long   0x05CD
+   .long   0x05C9
+   .long   0x05C6
+   .long   0x05C2
+   .long   0x05BE
+   .long   0x05BA
+   .long   0x05B7
+   .long   0x05B3
+   .long   0x05AF
+   .long   0x05AC
+   .long   0x05A8
+   .long   0x05A4
+   .long   0x05A1
+   .long   0x059D
+   .long   0x0599
+   .long   0x0596
+   .long   0x0592
+   .long   0x058F
+   .long   0x058B
+   .long   0x0587
+   .long   0x0584
+   .long   0x0580
+   .long   0x057C
+   .long   0x0579
+   .long   0x0575
+   .long   0x0572
+   .long   0x056E
+   .long   0x056B
+   .long   0x0567
+   .long   0x0563
+   .long   0x0560
+   .long   0x055C
+   .long   0x0559
+   .long   0x0555
+   .long   0x0552
+   .long   0x054E
+   .long   0x054A
+   .long   0x0547
+   .long   0x0543
+   .long   0x0540
+   .long   0x053C
+   .long   0x0539
+   .long   0x0535
+   .long   0x0532
+   .long   0x052E
+   .long   0x052B
+   .long   0x0527
+   .long   0x0524
+   .long   0x0520
+   .long   0x051D
+   .long   0x0519
+   .long   0x0516
+   .long   0x0512
+   .long   0x050F
+   .long   0x050B
+   .long   0x0508
+   .long   0x0505
+   .long   0x0501
+   .long   0x04FE
+   .long   0x04FA
+   .long   0x04F7
+   .long   0x04F3
+   .long   0x04F0
+   .long   0x04EC
+   .long   0x04E9
+   .long   0x04E6
+   .long   0x04E2
+   .long   0x04DF
+   .long   0x04DB
+   .long   0x04D8
+   .long   0x04D5
+   .long   0x04D1
+   .long   0x04CE
+   .long   0x04CA
+   .long   0x04C7
+   .long   0x04C4
+   .long   0x04C0
+   .long   0x04BD
+   .long   0x04BA
+   .long   0x04B6
+   .long   0x04B3
+   .long   0x04B0
+   .long   0x04AC
+   .long   0x04A9
+   .long   0x04A6
+   .long   0x04A2
+   .long   0x049F
+   .long   0x049C
+   .long   0x0498
+   .long   0x0495
+   .long   0x0492
+   .long   0x048E
+   .long   0x048B
+   .long   0x0488
+   .long   0x0484
+   .long   0x0481
+   .long   0x047E
+   .long   0x047B
+   .long   0x0477
+   .long   0x0474
+   .long   0x0471
+   .long   0x046E
+   .long   0x046A
+   .long   0x0467
+   .long   0x0464
+   .long   0x0461
+   .long   0x045D
+   .long   0x045A
+   .long   0x0457
+   .long   0x0454
+   .long   0x0450
+   .long   0x044D
+   .long   0x044A
+   .long   0x0447
+   .long   0x0444
+   .long   0x0440
+   .long   0x043D
+   .long   0x043A
+   .long   0x0437
+   .long   0x0434
+   .long   0x0430
+   .long   0x042D
+   .long   0x042A
+   .long   0x0427
+   .long   0x0424
+   .long   0x0420
+   .long   0x041D
+   .long   0x041A
+   .long   0x0417
+   .long   0x0414
+   .long   0x0411
+   .long   0x040E
+   .long   0x040A
+   .long   0x0407
+   .long   0x0404
+   .long   0x0401
+   .long   0x03FE
+   .long   0x03FB
+   .long   0x03F8
+   .long   0x03F5
+   .long   0x03F1
+   .long   0x03EE
+   .long   0x03EB
+   .long   0x03E8
+   .long   0x03E5
+   .long   0x03E2
+   .long   0x03DF
+   .long   0x03DC
+   .long   0x03D9
+   .long   0x03D6
+   .long   0x03D3
+   .long   0x03CF
+   .long   0x03CC
+   .long   0x03C9
+   .long   0x03C6
+   .long   0x03C3
+   .long   0x03C0
+   .long   0x03BD
+   .long   0x03BA
+   .long   0x03B7
+   .long   0x03B4
+   .long   0x03B1
+   .long   0x03AE
+   .long   0x03AB
+   .long   0x03A8
+   .long   0x03A5
+   .long   0x03A2
+   .long   0x039F
+   .long   0x039C
+   .long   0x0399
+   .long   0x0396
+   .long   0x0393
+   .long   0x0390
+   .long   0x038D
+   .long   0x038A
+   .long   0x0387
+   .long   0x0384
+   .long   0x0381
+   .long   0x037E
+   .long   0x037B
+   .long   0x0378
+   .long   0x0375
+   .long   0x0372
+   .long   0x036F
+   .long   0x036C
+   .long   0x0369
+   .long   0x0366
+   .long   0x0363
+   .long   0x0360
+   .long   0x035E
+   .long   0x035B
+   .long   0x0358
+   .long   0x0355
+   .long   0x0352
+   .long   0x034F
+   .long   0x034C
+   .long   0x0349
+   .long   0x0346
+   .long   0x0343
+   .long   0x0340
+   .long   0x033E
+   .long   0x033B
+   .long   0x0338
+   .long   0x0335
+   .long   0x0332
+   .long   0x032F
+   .long   0x032C
+   .long   0x0329
+   .long   0x0327
+   .long   0x0324
+   .long   0x0321
+   .long   0x031E
+   .long   0x031B
+   .long   0x0318
+   .long   0x0315
+   .long   0x0313
+   .long   0x0310
+   .long   0x030D
+   .long   0x030A
+   .long   0x0307
+   .long   0x0304
+   .long   0x0302
+   .long   0x02FF
+   .long   0x02FC
+   .long   0x02F9
+   .long   0x02F6
+   .long   0x02F3
+   .long   0x02F1
+   .long   0x02EE
+   .long   0x02EB
+   .long   0x02E8
+   .long   0x02E5
+   .long   0x02E3
+   .long   0x02E0
+   .long   0x02DD
+   .long   0x02DA
+   .long   0x02D8
+   .long   0x02D5
+   .long   0x02D2
+   .long   0x02CF
+   .long   0x02CC
+   .long   0x02CA
+   .long   0x02C7
+   .long   0x02C4
+   .long   0x02C1
+   .long   0x02BF
+   .long   0x02BC
+   .long   0x02B9
+   .long   0x02B7
+   .long   0x02B4
+   .long   0x02B1
+   .long   0x02AE
+   .long   0x02AC
+   .long   0x02A9
+   .long   0x02A6
+   .long   0x02A3
+   .long   0x02A1
+   .long   0x029E
+   .long   0x029B
+   .long   0x0299
+   .long   0x0296
+   .long   0x0293
+   .long   0x0291
+   .long   0x028E
+   .long   0x028B
+   .long   0x0288
+   .long   0x0286
+   .long   0x0283
+   .long   0x0280
+   .long   0x027E
+   .long   0x027B
+   .long   0x0278
+   .long   0x0276
+   .long   0x0273
+   .long   0x0270
+   .long   0x026E
+   .long   0x026B
+   .long   0x0268
+   .long   0x0266
+   .long   0x0263
+   .long   0x0261
+   .long   0x025E
+   .long   0x025B
+   .long   0x0259
+   .long   0x0256
+   .long   0x0253
+   .long   0x0251
+   .long   0x024E
+   .long   0x024C
+   .long   0x0249
+   .long   0x0246
+   .long   0x0244
+   .long   0x0241
+   .long   0x023E
+   .long   0x023C
+   .long   0x0239
+   .long   0x0237
+   .long   0x0234
+   .long   0x0232
+   .long   0x022F
+   .long   0x022C
+   .long   0x022A
+   .long   0x0227
+   .long   0x0225
+   .long   0x0222
+   .long   0x021F
+   .long   0x021D
+   .long   0x021A
+   .long   0x0218
+   .long   0x0215
+   .long   0x0213
+   .long   0x0210
+   .long   0x020E
+   .long   0x020B
+   .long   0x0208
+   .long   0x0206
+   .long   0x0203
+   .long   0x0201
+   .long   0x01FE
+   .long   0x01FC
+   .long   0x01F9
+   .long   0x01F7
+   .long   0x01F4
+   .long   0x01F2
+   .long   0x01EF
+   .long   0x01ED
+   .long   0x01EA
+   .long   0x01E8
+   .long   0x01E5
+   .long   0x01E3
+   .long   0x01E0
+   .long   0x01DE
+   .long   0x01DB
+   .long   0x01D9
+   .long   0x01D6
+   .long   0x01D4
+   .long   0x01D1
+   .long   0x01CF
+   .long   0x01CC
+   .long   0x01CA
+   .long   0x01C7
+   .long   0x01C5
+   .long   0x01C2
+   .long   0x01C0
+   .long   0x01BD
+   .long   0x01BB
+   .long   0x01B9
+   .long   0x01B6
+   .long   0x01B4
+   .long   0x01B1
+   .long   0x01AF
+   .long   0x01AC
+   .long   0x01AA
+   .long   0x01A7
+   .long   0x01A5
+   .long   0x01A3
+   .long   0x01A0
+   .long   0x019E
+   .long   0x019B
+   .long   0x0199
+   .long   0x0196
+   .long   0x0194
+   .long   0x0192
+   .long   0x018F
+   .long   0x018D
+   .long   0x018A
+   .long   0x0188
+   .long   0x0186
+   .long   0x0183
+   .long   0x0181
+   .long   0x017E
+   .long   0x017C
+   .long   0x017A
+   .long   0x0177
+   .long   0x0175
+   .long   0x0173
+   .long   0x0170
+   .long   0x016E
+   .long   0x016B
+   .long   0x0169
+   .long   0x0167
+   .long   0x0164
+   .long   0x0162
+   .long   0x0160
+   .long   0x015D
+   .long   0x015B
+   .long   0x0159
+   .long   0x0156
+   .long   0x0154
+   .long   0x0151
+   .long   0x014F
+   .long   0x014D
+   .long   0x014A
+   .long   0x0148
+   .long   0x0146
+   .long   0x0143
+   .long   0x0141
+   .long   0x013F
+   .long   0x013C
+   .long   0x013A
+   .long   0x0138
+   .long   0x0136
+   .long   0x0133
+   .long   0x0131
+   .long   0x012F
+   .long   0x012C
+   .long   0x012A
+   .long   0x0128
+   .long   0x0125
+   .long   0x0123
+   .long   0x0121
+   .long   0x011F
+   .long   0x011C
+   .long   0x011A
+   .long   0x0118
+   .long   0x0115
+   .long   0x0113
+   .long   0x0111
+   .long   0x010F
+   .long   0x010C
+   .long   0x010A
+   .long   0x0108
+   .long   0x0105
+   .long   0x0103
+   .long   0x0101
+   .long   0x00FF
+   .long   0x00FC
+   .long   0x00FA
+   .long   0x00F8
+   .long   0x00F6
+   .long   0x00F3
+   .long   0x00F1
+   .long   0x00EF
+   .long   0x00ED
+   .long   0x00EA
+   .long   0x00E8
+   .long   0x00E6
+   .long   0x00E4
+   .long   0x00E2
+   .long   0x00DF
+   .long   0x00DD
+   .long   0x00DB
+   .long   0x00D9
+   .long   0x00D6
+   .long   0x00D4
+   .long   0x00D2
+   .long   0x00D0
+   .long   0x00CE
+   .long   0x00CB
+   .long   0x00C9
+   .long   0x00C7
+   .long   0x00C5
+   .long   0x00C3
+   .long   0x00C0
+   .long   0x00BE
+   .long   0x00BC
+   .long   0x00BA
+   .long   0x00B8
+   .long   0x00B5
+   .long   0x00B3
+   .long   0x00B1
+   .long   0x00AF
+   .long   0x00AD
+   .long   0x00AB
+   .long   0x00A8
+   .long   0x00A6
+   .long   0x00A4
+   .long   0x00A2
+   .long   0x00A0
+   .long   0x009E
+   .long   0x009B
+   .long   0x0099
+   .long   0x0097
+   .long   0x0095
+   .long   0x0093
+   .long   0x0091
+   .long   0x008F
+   .long   0x008C
+   .long   0x008A
+   .long   0x0088
+   .long   0x0086
+   .long   0x0084
+   .long   0x0082
+   .long   0x0080
+   .long   0x007D
+   .long   0x007B
+   .long   0x0079
+   .long   0x0077
+   .long   0x0075
+   .long   0x0073
+   .long   0x0071
+   .long   0x006F
+   .long   0x006D
+   .long   0x006A
+   .long   0x0068
+   .long   0x0066
+   .long   0x0064
+   .long   0x0062
+   .long   0x0060
+   .long   0x005E
+   .long   0x005C
+   .long   0x005A
+   .long   0x0058
+   .long   0x0056
+   .long   0x0053
+   .long   0x0051
+   .long   0x004F
+   .long   0x004D
+   .long   0x004B
+   .long   0x0049
+   .long   0x0047
+   .long   0x0045
+   .long   0x0043
+   .long   0x0041
+   .long   0x003F
+   .long   0x003D
+   .long   0x003B
+   .long   0x0039
+   .long   0x0036
+   .long   0x0034
+   .long   0x0032
+   .long   0x0030
+   .long   0x002E
+   .long   0x002C
+   .long   0x002A
+   .long   0x0028
+   .long   0x0026
+   .long   0x0024
+   .long   0x0022
+   .long   0x0020
+   .long   0x001E
+   .long   0x001C
+   .long   0x001A
+   .long   0x0018
+   .long   0x0016
+   .long   0x0014
+   .long   0x0012
+   .long   0x0010
+   .long   0x000E
+   .long   0x000C
+   .long   0x000A
+   .long   0x0008
+   .long   0x0006
+   .long   0x0004
+   .long   0x0002
+

diff --git a/src/gas/vrd4log.S b/src/gas/vrd4log.S
new file mode 100644
index 0000000..1e2b1e4
--- /dev/null
+++ b/src/gas/vrd4log.S

@@ -0,0 +1,855 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd4log.asm
+#
+# A vector implementation of the log libm function.
+#
+# Prototype:
+#
+#    __m128d,__m128d __vrd4_log(__m128d x1, __m128d x2);
+#
+#   Computes the natural log of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.  This version can compute 4 logs in
+#   192 cycles, or 48 per value
+#
+# This routine computes 4 double precision log values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+#
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+.equ	p_idx,0x010		# index storage
+.equ	p_xexp,0x020		# index storage
+
+.equ	p_x2,0x030		# temporary for error checking operation
+.equ	p_idx2,0x040		# index storage
+.equ	p_xexp2,0x050		# index storage
+
+.equ	save_xa,0x060		#qword
+.equ	save_ya,0x068		#qword
+.equ	save_nv,0x070		#qword
+.equ	p_iter,0x078		# qword	storage for number of loop iterations
+
+.equ	save_rbx,0x080		#qword
+
+
+.equ	p2_temp,0x090		# second temporary for get/put bits operation
+.equ	p2_temp1,0x0b0		# second temporary for exponent multiply
+
+.equ	p_n1,0x0c0		# temporary for near one check
+.equ	p_n12,0x0d0		# temporary for near one check
+
+
+.equ	stack_size,0x0e8
+
+
+# parameters are expected as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrd4_log
+    .type   __vrd4_log,@function
+__vrd4_log:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rdi
+
+# process 4 values at a time.
+
+		movdqa	%xmm1,p_x2(%rsp)	# save the input values
+	movdqa	%xmm0,p_x(%rsp)	# save the input values
+# compute the logs
+
+##  if NaN or inf
+
+#      /* Store the exponent of x in xexp and put
+#         f into the range [0.5,1) */
+
+	pxor	%xmm1,%xmm1
+	movdqa	%xmm0,%xmm3
+	psrlq	$52,%xmm3
+	psubq	.L__mask_1023(%rip),%xmm3
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm6			# xexp
+	movdqa	%xmm0,%xmm2
+	subpd	.L__real_one(%rip),%xmm2
+
+	movapd	%xmm6,p_xexp(%rsp)
+	andpd	.L__real_notsign(%rip),%xmm2
+	xor		%rax,%rax
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+
+	cmppd	$1,.L__real_threshold(%rip),%xmm2
+	movmskpd	%xmm2,%ecx
+	movdqa	%xmm3,%xmm4
+	mov			%ecx,p_n1(%rsp)
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrlq	$45,%xmm3
+	movdqa	%xmm3,%xmm2
+	psrlq	$1,%xmm3
+	paddq	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm2
+	paddq	%xmm2,%xmm3
+
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm1
+		pxor	%xmm7,%xmm7
+		movdqa	p_x2(%rsp),%xmm2
+		movapd	p_x2(%rsp),%xmm5
+		psrlq	$52,%xmm2
+		psubq	.L__mask_1023(%rip),%xmm2
+		packssdw	%xmm7,%xmm2
+		subpd	.L__real_one(%rip),%xmm5
+		andpd	.L__real_notsign(%rip),%xmm5
+		cvtdq2pd	%xmm2,%xmm6			# xexp
+	xor		%rcx,%rcx
+		cmppd	$1,.L__real_threshold(%rip),%xmm5
+	movq	 %xmm3,p_idx(%rsp)
+
+# reduce and get u
+	por		.L__real_half(%rip),%xmm4
+	movdqa	%xmm4,%xmm2
+		movapd	%xmm6,p_xexp2(%rsp)
+
+	# do near one check
+		movmskpd	%xmm5,%edx
+		mov			%edx,p_n12(%rsp)
+
+	mulpd	.L__real_3f80000000000000(%rip),%xmm1				# f1 = index/128
+
+
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	mov		p_idx(%rsp),%eax
+		movdqa	p_x2(%rsp),%xmm6
+
+	movapd	.L__real_half(%rip),%xmm5							# .5
+	subpd	%xmm1,%xmm2											# f2 = f - f1
+		pand	.L__real_mant(%rip),%xmm6
+	mulpd	%xmm2,%xmm5
+	addpd	%xmm5,%xmm1
+
+		movdqa	%xmm6,%xmm8
+		psrlq	$45,%xmm6
+		movdqa	%xmm6,%xmm4
+
+		psrlq	$1,%xmm6
+		paddq	.L__mask_040(%rip),%xmm6
+		pand	.L__mask_001(%rip),%xmm4
+		paddq	%xmm4,%xmm6
+# do error checking here for scheduling.  Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+##  if NaN or inf
+	movapd	%xmm0,%xmm3
+	andpd	.L__real_inf(%rip),%xmm3
+	cmppd	$0,.L__real_inf(%rip),%xmm3
+	movmskpd	%xmm3,%r8d
+		packssdw	%xmm7,%xmm6
+		por		.L__real_half(%rip),%xmm8
+		movq	 %xmm6,p_idx2(%rsp)
+		cvtdq2pd	%xmm6,%xmm9
+
+	cmppd	$2,.L__real_zero(%rip),%xmm0
+		mulpd	.L__real_3f80000000000000(%rip),%xmm9				# f1 = index/128
+	movmskpd	%xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+	divpd	%xmm1,%xmm2				# u
+
+# compute the index into the log tables
+#
+
+        movlpd   -512(%rdx,%rax,8),%xmm0                # z1
+        mov             p_idx+4(%rsp),%ecx
+        movhpd   -512(%rdx,%rcx,8),%xmm0                # z1
+# solve for ln(1+u)
+	movapd	%xmm2,%xmm1				# u
+	mulpd	%xmm2,%xmm2				# u^2
+	movapd	%xmm2,%xmm5
+	movapd	.L__real_cb3(%rip),%xmm3
+	mulpd	%xmm2,%xmm3				#Cu2
+	mulpd	%xmm1,%xmm5				# u^3
+	addpd	.L__real_cb2(%rip),%xmm3 #B+Cu2
+
+	mulpd	%xmm5,%xmm2				# u^5
+	movapd	.L__real_log2_lead(%rip),%xmm4
+
+	mulpd	.L__real_cb1(%rip),%xmm5 #Au3
+	addpd	%xmm5,%xmm1				# u+Au3
+	mulpd	%xmm3,%xmm2				# u5(B+Cu2)
+
+	movapd	p_xexp(%rsp),%xmm5		# xexp
+	addpd	%xmm2,%xmm1				# poly
+# recombine
+	mulpd	%xmm5,%xmm4				# xexp * log2_lead
+	addpd	%xmm4,%xmm0				#r1
+	lea		.L__np_ln_tail_table(%rip),%rdx
+        movlpd   -512(%rdx,%rax,8),%xmm4                #z2     +=q
+        movhpd   -512(%rdx,%rcx,8),%xmm4                #z2     +=q
+		lea		.L__np_ln_lead_table(%rip),%rdx
+		mov		p_idx2(%rsp),%eax
+		mov		p_idx2+4(%rsp),%ecx
+	addpd	%xmm4,%xmm1
+
+	mulpd	.L__real_log2_tail(%rip),%xmm5
+
+		movapd	.L__real_half(%rip),%xmm4							# .5
+		subpd	%xmm9,%xmm8											# f2 = f - f1
+		mulpd	%xmm8,%xmm4
+		addpd	%xmm4,%xmm9
+
+	addpd	%xmm5,%xmm1				#r2
+		divpd	%xmm9,%xmm8				# u
+		movapd	p_x2(%rsp),%xmm3
+		andpd	.L__real_inf(%rip),%xmm3
+		cmppd	$0,.L__real_inf(%rip),%xmm3
+		movmskpd	%xmm3,%r10d
+		movapd	p_x2(%rsp),%xmm6
+		cmppd	$2,.L__real_zero(%rip),%xmm6
+		movmskpd	%xmm6,%r11d
+
+# check for nans/infs
+	test		$3,%r8d
+	addpd	%xmm1,%xmm0
+	jnz		.L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+	test		$3,%r9d
+	jnz		.L__z_or_n
+
+
+.L__vlog2:
+
+
+	# It seems like a good idea to try and interleave
+	# even more of the following code sooner into the
+	# program.  But there were conflicts with the table
+	# index registers, making the problem difficult.
+	# After a lot of work in a branch of this file,
+	# I was not able to match the speed of this version.
+	# CodeAnalyst shows that there is lots of unused add
+	# pipe time around the divides, but the processor
+	# doesn't seem to be able to schedule in those slots.
+
+	        movlpd   -512(%rdx,%rax,8),%xmm7                #z2     +=q
+        	movhpd   -512(%rdx,%rcx,8),%xmm7                #z2     +=q
+
+# check for near one
+	mov			p_n1(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one1
+.L__vlog2n:
+
+
+	# solve for ln(1+u)
+		movapd	%xmm8,%xmm9				# u
+		mulpd	%xmm8,%xmm8				# u^2
+		movapd	%xmm8,%xmm5
+		movapd	.L__real_cb3(%rip),%xmm3
+		mulpd	%xmm8,%xmm3				#Cu2
+		mulpd	%xmm9,%xmm5				# u^3
+		addpd	.L__real_cb2(%rip),%xmm3 		#B+Cu2
+
+		mulpd	%xmm5,%xmm8				# u^5
+		movapd	.L__real_log2_lead(%rip),%xmm4
+
+		mulpd	.L__real_cb1(%rip),%xmm5 		#Au3
+		addpd	%xmm5,%xmm9				# u+Au3
+		mulpd	%xmm3,%xmm8				# u5(B+Cu2)
+
+		movapd	p_xexp2(%rsp),%xmm5			# xexp
+		addpd	%xmm8,%xmm9				# poly
+	# recombine
+		mulpd	%xmm5,%xmm4
+		addpd	%xmm4,%xmm7				#r1
+		lea		.L__np_ln_tail_table(%rip),%rdx
+	        movlpd   -512(%rdx,%rax,8),%xmm2                #z2     +=q
+        	movhpd   -512(%rdx,%rcx,8),%xmm2                #z2     +=q
+		addpd	%xmm2,%xmm9
+
+		mulpd	.L__real_log2_tail(%rip),%xmm5
+
+		addpd	%xmm5,%xmm9				#r2
+
+	# check for nans/infs
+		test		$3,%r10d
+		addpd	%xmm9,%xmm7
+		jnz		.L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+		test		$3,%r11d
+		jnz		.L__z_or_n2
+
+.L__vlog4:
+
+
+	mov			p_n12(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one2
+
+.L__vlog4n:
+
+# store the result _m128d
+		movapd	%xmm7,%xmm1
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+	.align	16
+.Lboth_nearone:
+# saves 10 cycles
+#      r = x - 1.0;
+	movapd	.L__real_two(%rip),%xmm2
+	subpd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addpd	%xmm0,%xmm2
+	movapd	%xmm0,%xmm1
+	divpd	%xmm2,%xmm1		# u
+	movapd	.L__real_ca4(%rip),%xmm4	  #D
+	movapd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movapd	%xmm0,%xmm6
+	mulpd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addpd	%xmm1,%xmm1		#u
+	movapd	%xmm1,%xmm2
+	mulpd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulpd	%xmm1,%xmm5		# Cu
+	movapd	%xmm1,%xmm3
+	mulpd	%xmm2,%xmm3		# u^3
+	mulpd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulpd	%xmm3,%xmm4		#Du^3
+
+	addpd	.L__real_ca1(%rip),%xmm2	# +A
+	movapd	%xmm3,%xmm1
+	mulpd	%xmm1,%xmm1		# u^6
+	addpd	%xmm4,%xmm5		#Cu+Du3
+
+	mulpd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulpd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addpd	%xmm1,%xmm2
+	subpd	%xmm6,%xmm2		# -correction
+
+#      return r + r2;
+	addpd	%xmm2,%xmm0
+	ret
+
+	.align	16
+.L__near_one1:
+	cmp	$3,%r9d
+	jnz		.L__n1nb1
+
+	movapd	p_x(%rsp),%xmm0
+	call	.Lboth_nearone
+	jmp		.L__vlog2n
+
+	.align	16
+.L__n1nb1:
+	test	$1,%r9d
+	jz		.L__lnn12
+
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__ln1
+
+.L__lnn12:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn1e
+	movlpd	%xmm0,p_x(%rsp)
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,p_x+8(%rsp)
+	movapd	p_x(%rsp),%xmm0
+
+.L__lnn1e:
+	jmp		.L__vlog2n
+
+
+	.align	16
+.L__near_one2:
+	cmp	$3,%r9d
+	jnz		.L__n1nb2
+
+	movapd	%xmm0,%xmm8
+	movapd	p_x2(%rsp),%xmm0
+	call	.Lboth_nearone
+	movapd	%xmm0,%xmm7
+	movapd	%xmm8,%xmm0
+	jmp		.L__vlog4n
+
+	.align	16
+.L__n1nb2:
+	movapd	%xmm0,%xmm8
+	test	$1,%r9d
+	jz		.L__lnn22
+
+	movapd	%xmm7,%xmm0
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__ln1
+	movapd	%xmm0,%xmm7
+
+.L__lnn22:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn2e
+	movlpd	%xmm7,p_x2(%rsp)
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,p_x2+8(%rsp)
+	movapd	p_x2(%rsp),%xmm7
+
+.L__lnn2e:
+	movapd	%xmm8,%xmm0
+	jmp		.L__vlog4n
+
+	.align	16
+
+.L__ln1:
+# saves 10 cycles
+#      r = x - 1.0;
+	movlpd	.L__real_two(%rip),%xmm2
+	subsd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addsd	%xmm0,%xmm2
+	movsd	%xmm0,%xmm1
+	divsd	%xmm2,%xmm1		# u
+	movlpd	.L__real_ca4(%rip),%xmm4	  #D
+	movlpd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movsd	%xmm0,%xmm6
+	mulsd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addsd	%xmm1,%xmm1		#u
+	movsd	%xmm1,%xmm2
+	mulsd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulsd	%xmm1,%xmm5		# Cu
+	movsd	%xmm1,%xmm3
+	mulsd	%xmm2,%xmm3		# u^3
+	mulsd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulsd	%xmm3,%xmm4		#Du^3
+
+	addsd	.L__real_ca1(%rip),%xmm2	# +A
+	movsd	%xmm3,%xmm1
+	mulsd	%xmm1,%xmm1		# u^6
+	addsd	%xmm4,%xmm5		#Cu+Du3
+
+	mulsd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulsd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addsd	%xmm1,%xmm2
+	subsd	%xmm6,%xmm2		# -correction
+
+#      return r + r2;
+	addsd	%xmm2,%xmm0
+	ret
+
+	.align	16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+	test		$1,%r8d		# first number?
+	jz		.L__lninf2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rdx
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+	test		$2,%r8d		# second number?
+	jz		.L__lninfe
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rdx
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+	jmp		.L__vlog1		# continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+	movapd	%xmm0,%xmm2
+	test		$1,%r10d		# first number?
+	jz		.L__lninf22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm7,%xmm1		# save the inputs
+	mov		p_x2(%rsp),%rdx
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm7,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+	movapd	%xmm0,%xmm7
+
+.L__lninf22:
+	test		$2,%r10d		# second number?
+	jz		.L__lninfe2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rdx
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+	movapd	%xmm2,%xmm0
+	jmp		.L__vlog3		# continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__lnan					# jump if mantissa not zero, so it's a NaN
+# inf
+	rcl		$1,%rdx
+	jnc		.L__lne2					# log(+inf) = inf
+# negative x
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+#NaN
+.L__lnan:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__lne:
+	movd	%rdx,%xmm0
+.L__lne2:
+	ret
+
+	.align	16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+	test		$1,%r9d		# first number?
+	jz		.L__zn2
+
+	mov		%rax,p2_temp(%rsp)
+ 	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+	test		$2,%r9d		# second number?
+	jz		.L__zne
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne:
+	jmp		.L__vlog2
+
+.L__z_or_n2:
+	movapd	%xmm0,%xmm2
+	test		$1,%r11d		# first number?
+	jz		.L__zn22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm7,%xmm0
+	movapd	%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+	test		$2,%r11d		# second number?
+	jz		.L__zne2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+	movapd	%xmm2,%xmm0
+	jmp		.L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+	shl		$1,%rax
+	jnz		.L__zn_x		 ## if just a carry, then must be negative
+	movlpd	.L__real_ninf(%rip),%xmm0  # C99 specs -inf for +-0
+	ret
+.L__zn_x:
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+
+
+	.data
+	.align	16
+
+.L__real_one:				.quad 0x03ff0000000000000	# 1.0
+						.quad 0x03ff0000000000000					# for alignment
+.L__real_two:				.quad 0x04000000000000000	# 1.0
+						.quad 0x04000000000000000
+.L__real_ninf:				.quad 0x0fff0000000000000	# -inf
+						.quad 0x0fff0000000000000
+.L__real_inf:				.quad 0x07ff0000000000000	# +inf
+						.quad 0x07ff0000000000000
+.L__real_nan:				.quad 0x07ff8000000000000	# NaN
+						.quad 0x07ff8000000000000
+
+.L__real_zero:				.quad 0x00000000000000000	# 0.0
+						.quad 0x00000000000000000
+
+.L__real_sign:				.quad 0x08000000000000000	# sign bit
+						.quad 0x08000000000000000
+.L__real_notsign:			.quad 0x07ffFFFFFFFFFFFFF	# ^sign bit
+						.quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold:		.quad 0x03F9EB85000000000	# .03
+						.quad 0x03F9EB85000000000
+.L__real_qnanbit:			.quad 0x00008000000000000	# quiet nan bit
+						.quad 0x00008000000000000
+.L__real_mant:				.quad 0x0000FFFFFFFFFFFFF	# mantissa bits
+						.quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000:	.quad 0x03f80000000000000	# /* 0.0078125 = 1/128 */
+						.quad 0x03f80000000000000
+.L__mask_1023:				.quad 0x000000000000003ff	#
+						.quad 0x000000000000003ff
+.L__mask_040:				.quad 0x00000000000000040	#
+						.quad 0x00000000000000040
+.L__mask_001:				.quad 0x00000000000000001	#
+						.quad 0x00000000000000001
+
+.L__real_ca1:				.quad 0x03fb55555555554e6	# 8.33333333333317923934e-02
+						.quad 0x03fb55555555554e6
+.L__real_ca2:				.quad 0x03f89999999bac6d4	# 1.25000000037717509602e-02
+						.quad 0x03f89999999bac6d4
+.L__real_ca3:				.quad 0x03f62492307f1519f	# 2.23213998791944806202e-03
+						.quad 0x03f62492307f1519f
+.L__real_ca4:				.quad 0x03f3c8034c85dfff0	# 4.34887777707614552256e-04
+						.quad 0x03f3c8034c85dfff0
+
+
+.L__real_cb1:				.quad 0x03fb5555555555557	# 8.33333333333333593622e-02
+						.quad 0x03fb5555555555557
+.L__real_cb2:				.quad 0x03f89999999865ede	# 1.24999999978138668903e-02
+						.quad 0x03f89999999865ede
+.L__real_cb3:				.quad 0x03f6249423bd94741	# 2.23219810758559851206e-03
+						.quad 0x03f6249423bd94741
+.L__real_log2_lead:  		.quad 0x03fe62e42e0000000	# log2_lead	  6.93147122859954833984e-01
+						.quad 0x03fe62e42e0000000
+.L__real_log2_tail: 		.quad 0x03e6efa39ef35793c	# log2_tail	  5.76999904754328540596e-08
+						.quad 0x03e6efa39ef35793c
+
+.L__real_half:				.quad 0x03fe0000000000000	# 1/2
+						.quad 0x03fe0000000000000
+
+	.align	16
+
+.L__np_ln_lead_table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3f8fc0a800000000		# 1.55041813850402832031e-02
+	.quad	0x3f9f829800000000		# 3.07716131210327148438e-02
+	.quad	0x3fa7745800000000		# 4.58095073699951171875e-02
+	.quad	0x3faf0a3000000000		# 6.06245994567871093750e-02
+	.quad	0x3fb341d700000000		# 7.52233862876892089844e-02
+	.quad	0x3fb6f0d200000000		# 8.96121263504028320312e-02
+	.quad	0x3fba926d00000000		# 1.03796780109405517578e-01
+	.quad	0x3fbe270700000000		# 1.17783010005950927734e-01
+	.quad	0x3fc0d77e00000000		# 1.31576299667358398438e-01
+	.quad	0x3fc2955280000000		# 1.45181953907012939453e-01
+	.quad	0x3fc44d2b00000000		# 1.58604979515075683594e-01
+	.quad	0x3fc5ff3000000000		# 1.71850204467773437500e-01
+	.quad	0x3fc7ab8900000000		# 1.84922337532043457031e-01
+	.quad	0x3fc9525a80000000		# 1.97825729846954345703e-01
+	.quad	0x3fcaf3c900000000		# 2.10564732551574707031e-01
+	.quad	0x3fcc8ff780000000		# 2.23143517971038818359e-01
+	.quad	0x3fce270700000000		# 2.35566020011901855469e-01
+	.quad	0x3fcfb91800000000		# 2.47836112976074218750e-01
+	.quad	0x3fd0a324c0000000		# 2.59957492351531982422e-01
+	.quad	0x3fd1675c80000000		# 2.71933674812316894531e-01
+	.quad	0x3fd22941c0000000		# 2.83768117427825927734e-01
+	.quad	0x3fd2e8e280000000		# 2.95464158058166503906e-01
+	.quad	0x3fd3a64c40000000		# 3.07025015354156494141e-01
+	.quad	0x3fd4618bc0000000		# 3.18453729152679443359e-01
+	.quad	0x3fd51aad80000000		# 3.29753279685974121094e-01
+	.quad	0x3fd5d1bd80000000		# 3.40926527976989746094e-01
+	.quad	0x3fd686c800000000		# 3.51976394653320312500e-01
+	.quad	0x3fd739d7c0000000		# 3.62905442714691162109e-01
+	.quad	0x3fd7eaf800000000		# 3.73716354370117187500e-01
+	.quad	0x3fd89a3380000000		# 3.84411692619323730469e-01
+	.quad	0x3fd9479400000000		# 3.94993782043457031250e-01
+	.quad	0x3fd9f323c0000000		# 4.05465066432952880859e-01
+	.quad	0x3fda9cec80000000		# 4.15827870368957519531e-01
+	.quad	0x3fdb44f740000000		# 4.26084339618682861328e-01
+	.quad	0x3fdbeb4d80000000		# 4.36236739158630371094e-01
+	.quad	0x3fdc8ff7c0000000		# 4.46287095546722412109e-01
+	.quad	0x3fdd32fe40000000		# 4.56237375736236572266e-01
+	.quad	0x3fddd46a00000000		# 4.66089725494384765625e-01
+	.quad	0x3fde744240000000		# 4.75845873355865478516e-01
+	.quad	0x3fdf128f40000000		# 4.85507786273956298828e-01
+	.quad	0x3fdfaf5880000000		# 4.95077252388000488281e-01
+	.quad	0x3fe02552a0000000		# 5.04556000232696533203e-01
+	.quad	0x3fe0723e40000000		# 5.13945698738098144531e-01
+	.quad	0x3fe0be72e0000000		# 5.23248136043548583984e-01
+	.quad	0x3fe109f380000000		# 5.32464742660522460938e-01
+	.quad	0x3fe154c3c0000000		# 5.41597247123718261719e-01
+	.quad	0x3fe19ee6a0000000		# 5.50647079944610595703e-01
+	.quad	0x3fe1e85f40000000		# 5.59615731239318847656e-01
+	.quad	0x3fe23130c0000000		# 5.68504691123962402344e-01
+	.quad	0x3fe2795e00000000		# 5.77315330505371093750e-01
+	.quad	0x3fe2c0e9e0000000		# 5.86049020290374755859e-01
+	.quad	0x3fe307d720000000		# 5.94707071781158447266e-01
+	.quad	0x3fe34e2880000000		# 6.03290796279907226562e-01
+	.quad	0x3fe393e0c0000000		# 6.11801505088806152344e-01
+	.quad	0x3fe3d90260000000		# 6.20240390300750732422e-01
+	.quad	0x3fe41d8fe0000000		# 6.28608644008636474609e-01
+	.quad	0x3fe4618bc0000000		# 6.36907458305358886719e-01
+	.quad	0x3fe4a4f840000000		# 6.45137906074523925781e-01
+	.quad	0x3fe4e7d800000000		# 6.53301239013671875000e-01
+	.quad	0x3fe52a2d20000000		# 6.61398470401763916016e-01
+	.quad	0x3fe56bf9c0000000		# 6.69430613517761230469e-01
+	.quad	0x3fe5ad4040000000		# 6.77398800849914550781e-01
+	.quad	0x3fe5ee02a0000000		# 6.85303986072540283203e-01
+	.quad	0x3fe62e42e0000000		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_tail_table:
+	.quad	0x00000000000000000 # 0	; 0.00000000000000000000e+00
+	.quad	0x03e361f807c79f3db		# 5.15092497094772879206e-09
+	.quad	0x03e6873c1980267c8		# 4.55457209735272790188e-08
+	.quad	0x03e5ec65b9f88c69e		# 2.86612990859791781788e-08
+	.quad	0x03e58022c54cc2f99		# 2.23596477332056055352e-08
+	.quad	0x03e62c37a3a125330		# 3.49498983167142274770e-08
+	.quad	0x03e615cad69737c93		# 3.23392843005887000414e-08
+	.quad	0x03e4d256ab1b285e9		# 1.35722380472479366661e-08
+	.quad	0x03e5b8abcb97a7aa2		# 2.56504325268044191098e-08
+	.quad	0x03e6f34239659a5dc		# 5.81213608741512136843e-08
+	.quad	0x03e6e07fd48d30177		# 5.59374849578288093334e-08
+	.quad	0x03e6b32df4799f4f6		# 5.06615629004996189970e-08
+	.quad	0x03e6c29e4f4f21cf8		# 5.24588857848400955725e-08
+	.quad	0x03e1086c848df1b59		# 9.61968535632653505972e-10
+	.quad	0x03e4cf456b4764130		# 1.34829655346594463137e-08
+	.quad	0x03e63a02ffcb63398		# 3.65557749306383026498e-08
+	.quad	0x03e61e6a6886b0976		# 3.33431709374069198903e-08
+	.quad	0x03e6b8abcb97a7aa2		# 5.13008650536088382197e-08
+	.quad	0x03e6b578f8aa35552		# 5.09285070380306053751e-08
+	.quad	0x03e6139c871afb9fc		# 3.20853940845502057341e-08
+	.quad	0x03e65d5d30701ce64		# 4.06713248643004200446e-08
+	.quad	0x03e6de7bcb2d12142		# 5.57028186706125221168e-08
+	.quad	0x03e6d708e984e1664		# 5.48356693724804282546e-08
+	.quad	0x03e556945e9c72f36		# 1.99407553679345001938e-08
+	.quad	0x03e20e2f613e85bda		# 1.96585517245087232086e-09
+	.quad	0x03e3cb7e0b42724f6		# 6.68649386072067321503e-09
+	.quad	0x03e6fac04e52846c7		# 5.89936034642113390002e-08
+	.quad	0x03e5e9b14aec442be		# 2.85038578721554472484e-08
+	.quad	0x03e6b5de8034e7126		# 5.09746772910284482606e-08
+	.quad	0x03e6dc157e1b259d3		# 5.54234668933210171467e-08
+	.quad	0x03e3b05096ad69c62		# 6.29100830926604004874e-09
+	.quad	0x03e5c2116faba4cdd		# 2.61974119468563937716e-08
+	.quad	0x03e665fcc25f95b47		# 4.16752115011186398935e-08
+	.quad	0x03e5a9a08498d4850		# 2.47747534460820790327e-08
+	.quad	0x03e6de647b1465f77		# 5.56922172017964209793e-08
+	.quad	0x03e5da71b7bf7861d		# 2.76162876992552906035e-08
+	.quad	0x03e3e6a6886b09760		# 7.08169709942321478061e-09
+	.quad	0x03e6f0075eab0ef64		# 5.77453510221151779025e-08
+	.quad	0x03e33071282fb989b		# 4.43021445893361960146e-09
+	.quad	0x03e60eb43c3f1bed2		# 3.15140984357495864573e-08
+	.quad	0x03e5faf06ecb35c84		# 2.95077445089736670973e-08
+	.quad	0x03e4ef1e63db35f68		# 1.44098510263167149349e-08
+	.quad	0x03e469743fb1a71a5		# 1.05196987538551827693e-08
+	.quad	0x03e6c1cdf404e5796		# 5.23641361722697546261e-08
+	.quad	0x03e4094aa0ada625e		# 7.72099925253243069458e-09
+	.quad	0x03e6e2d4c96fde3ec		# 5.62089493829364197156e-08
+	.quad	0x03e62f4d5e9a98f34		# 3.53090261098577946927e-08
+	.quad	0x03e6467c96ecc5cbe		# 3.80080516835568242269e-08
+	.quad	0x03e6e7040d03dec5a		# 5.66961038386146408282e-08
+	.quad	0x03e67bebf4282de36		# 4.42287063097349852717e-08
+	.quad	0x03e6289b11aeb783f		# 3.45294525105681104660e-08
+	.quad	0x03e5a891d1772f538		# 2.47132034530447431509e-08
+	.quad	0x03e634f10be1fb591		# 3.59655343422487209774e-08
+	.quad	0x03e6d9ce1d316eb93		# 5.51581770357780862071e-08
+	.quad	0x03e63562a19a9c442		# 3.60171867511861372793e-08
+	.quad	0x03e54e2adf548084c		# 1.94511067964296180547e-08
+	.quad	0x03e508ce55cc8c97a		# 1.54137376631349347838e-08
+	.quad	0x03e30e2f613e85bda		# 3.93171034490174464173e-09
+	.quad	0x03e6db03ebb0227bf		# 5.52990607758839766440e-08
+	.quad	0x03e61b75bb09cb098		# 3.29990737637586136511e-08
+	.quad	0x03e496f16abb9df22		# 1.18436010922446096216e-08
+	.quad	0x03e65b3f399411c62		# 4.04248680368301346709e-08
+	.quad	0x03e586b3e59f65355		# 2.27418915900284316293e-08
+	.quad	0x03e52482ceae1ac12		# 1.70263791333409206020e-08
+	.quad	0x03e6efa39ef35793c		# 5.76999904754328540596e-08
+	.quad 0					# for alignment
+

diff --git a/src/gas/vrd4log10.S b/src/gas/vrd4log10.S
new file mode 100644
index 0000000..d0f861c
--- /dev/null
+++ b/src/gas/vrd4log10.S

@@ -0,0 +1,924 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd4log10.asm
+#
+# A vector implementation of the log10 libm function.
+#
+# Prototype:
+#
+#    __m128d,__m128d __vrd4_log10(__m128d x1, __m128d x2);
+#
+#   Computes the natural log10 of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.  This version can compute 4 log10s in
+#   220 cycles, or 55 per value
+#
+# This routine computes 4 double precision log10 values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+.equ	p_idx,0x010		# index storage
+.equ	p_xexp,0x020		# index storage
+
+.equ	p_x2,0x030		# temporary for error checking operation
+.equ	p_idx2,0x040		# index storage
+.equ	p_xexp2,0x050		# index storage
+
+.equ	save_xa,0x060		#qword
+.equ	save_ya,0x068		#qword
+.equ	save_nv,0x070		#qword
+.equ	p_iter,0x078		# qword	storage for number of loop iterations
+
+.equ	save_rbx,0x080		#qword
+
+
+.equ	p2_temp,0x090		# second temporary for get/put bits operation
+.equ	p2_temp1,0x0b0		# second temporary for exponent multiply
+
+.equ	p_n1,0x0c0		# temporary for near one check
+.equ	p_n12,0x0d0		# temporary for near one check
+
+
+.equ	stack_size,0x0e8
+
+
+# parameters are expected as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrd4_log10
+    .type   __vrd4_log10,@function
+__vrd4_log10:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rdi
+
+# process 4 values at a time.
+
+		movdqa	%xmm1,p_x2(%rsp)	# save the input values
+	movdqa	%xmm0,p_x(%rsp)	# save the input values
+# compute the log10s
+
+##  if NaN or inf
+
+#      /* Store the exponent of x in xexp and put
+#         f into the range [0.5,1) */
+
+	pxor	%xmm1,%xmm1
+	movdqa	%xmm0,%xmm3
+	psrlq	$52,%xmm3
+	psubq	.L__mask_1023(%rip),%xmm3
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm6			# xexp
+	movdqa	%xmm0,%xmm2
+	subpd	.L__real_one(%rip),%xmm2
+
+	movapd	%xmm6,p_xexp(%rsp)
+	andpd	.L__real_notsign(%rip),%xmm2
+	xor		%rax,%rax
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+
+	cmppd	$1,.L__real_threshold(%rip),%xmm2
+	movmskpd	%xmm2,%ecx
+	movdqa	%xmm3,%xmm4
+	mov			%ecx,p_n1(%rsp)
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrlq	$45,%xmm3
+	movdqa	%xmm3,%xmm2
+	psrlq	$1,%xmm3
+	paddq	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm2
+	paddq	%xmm2,%xmm3
+
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm1
+		pxor	%xmm7,%xmm7
+		movdqa	p_x2(%rsp),%xmm2
+		movapd	p_x2(%rsp),%xmm5
+		psrlq	$52,%xmm2
+		psubq	.L__mask_1023(%rip),%xmm2
+		packssdw	%xmm7,%xmm2
+		subpd	.L__real_one(%rip),%xmm5
+		andpd	.L__real_notsign(%rip),%xmm5
+		cvtdq2pd	%xmm2,%xmm6			# xexp
+	xor		%rcx,%rcx
+		cmppd	$1,.L__real_threshold(%rip),%xmm5
+	movq	 %xmm3,p_idx(%rsp)
+
+# reduce and get u
+	por		.L__real_half(%rip),%xmm4
+	movdqa	%xmm4,%xmm2
+		movapd	%xmm6,p_xexp2(%rsp)
+
+	# do near one check
+		movmskpd	%xmm5,%edx
+		mov			%edx,p_n12(%rsp)
+
+	mulpd	.L__real_3f80000000000000(%rip),%xmm1				# f1 = index/128
+
+
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	mov		p_idx(%rsp),%eax
+		movdqa	p_x2(%rsp),%xmm6
+
+	movapd	.L__real_half(%rip),%xmm5							# .5
+	subpd	%xmm1,%xmm2											# f2 = f - f1
+		pand	.L__real_mant(%rip),%xmm6
+	mulpd	%xmm2,%xmm5
+	addpd	%xmm5,%xmm1
+
+		movdqa	%xmm6,%xmm8
+		psrlq	$45,%xmm6
+		movdqa	%xmm6,%xmm4
+
+		psrlq	$1,%xmm6
+		paddq	.L__mask_040(%rip),%xmm6
+		pand	.L__mask_001(%rip),%xmm4
+		paddq	%xmm4,%xmm6
+# do error checking here for scheduling.  Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+##  if NaN or inf
+	movapd	%xmm0,%xmm3
+	andpd	.L__real_inf(%rip),%xmm3
+	cmppd	$0,.L__real_inf(%rip),%xmm3
+	movmskpd	%xmm3,%r8d
+		packssdw	%xmm7,%xmm6
+		por		.L__real_half(%rip),%xmm8
+		movq	 %xmm6,p_idx2(%rsp)
+		cvtdq2pd	%xmm6,%xmm9
+
+	cmppd	$2,.L__real_zero(%rip),%xmm0
+		mulpd	.L__real_3f80000000000000(%rip),%xmm9				# f1 = index/128
+	movmskpd	%xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+	divpd	%xmm1,%xmm2				# u
+
+# compute the index into the log10 tables
+#
+
+        movlpd   -512(%rdx,%rax,8),%xmm0                # z1
+        mov             p_idx+4(%rsp),%ecx
+        movhpd   -512(%rdx,%rcx,8),%xmm0                # z1
+# solve for ln(1+u)
+	movapd	%xmm2,%xmm1				# u
+	mulpd	%xmm2,%xmm2				# u^2
+	movapd	%xmm2,%xmm5
+	movapd	.L__real_cb3(%rip),%xmm3
+	mulpd	%xmm2,%xmm3				#Cu2
+	mulpd	%xmm1,%xmm5				# u^3
+	addpd	.L__real_cb2(%rip),%xmm3 #B+Cu2
+
+	mulpd	%xmm5,%xmm2				# u^5
+	movapd	.L__real_log2_lead(%rip),%xmm4
+
+	mulpd	.L__real_cb1(%rip),%xmm5 #Au3
+	addpd	%xmm5,%xmm1				# u+Au3
+	mulpd	%xmm3,%xmm2				# u5(B+Cu2)
+
+	movapd	p_xexp(%rsp),%xmm5		# xexp
+	addpd	%xmm2,%xmm1				# poly
+# recombine
+	mulpd	%xmm5,%xmm4				# xexp * log2_lead
+	addpd	%xmm4,%xmm0				#r1
+	movapd  %xmm0,%xmm2				#for log10
+	lea		.L__np_ln_tail_table(%rip),%rdx
+        movlpd   -512(%rdx,%rax,8),%xmm4                #z2     +=q
+        movhpd   -512(%rdx,%rcx,8),%xmm4                #z2     +=q
+        mulpd 	.L__real_log10e_tail(%rip),%xmm0	#for log10
+	    mulpd 	.L__real_log10e_lead(%rip),%xmm2  #for log10
+		lea		.L__np_ln_lead_table(%rip),%rdx
+		mov		p_idx2(%rsp),%eax
+		mov		p_idx2+4(%rsp),%ecx
+	addpd	%xmm4,%xmm1
+
+
+
+	mulpd	.L__real_log2_tail(%rip),%xmm5
+
+		movapd	.L__real_half(%rip),%xmm4							# .5
+
+
+		subpd	%xmm9,%xmm8											# f2 = f - f1
+		mulpd	%xmm8,%xmm4
+		addpd	%xmm4,%xmm9
+
+	addpd	%xmm5,%xmm1				#r2
+		movapd  %xmm1,%xmm7				#for log10
+		mulpd 	.L__real_log10e_tail(%rip),%xmm1 #for log10
+		addpd   %xmm1,%xmm0		#for log10
+
+
+		divpd	%xmm9,%xmm8				# u
+		movapd	p_x2(%rsp),%xmm3
+		mulpd   .L__real_log10e_lead(%rip),%xmm7 #log10
+		andpd	.L__real_inf(%rip),%xmm3
+
+		cmppd	$0,.L__real_inf(%rip),%xmm3
+		movmskpd	%xmm3,%r10d
+		addpd   %xmm7,%xmm0		#for log10
+		movapd	p_x2(%rsp),%xmm6
+		cmppd	$2,.L__real_zero(%rip),%xmm6
+		movmskpd	%xmm6,%r11d
+
+
+
+
+# check for nans/infs
+	test		$3,%r8d
+	addpd  %xmm2,%xmm0	#for log10
+#	addpd	%xmm1,%xmm0
+	jnz		.L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+	test		$3,%r9d
+	jnz		.L__z_or_n
+
+
+.L__vlog2:
+
+
+	# It seems like a good idea to try and interleave
+	# even more of the following code sooner into the
+	# program.  But there were conflicts with the table
+	# index registers, making the problem difficult.
+	# After a lot of work in a branch of this file,
+	# I was not able to match the speed of this version.
+	# CodeAnalyst shows that there is lots of unused add
+	# pipe time around the divides, but the processor
+	# doesn't seem to be able to schedule in those slots.
+
+	        movlpd   -512(%rdx,%rax,8),%xmm7                #z2     +=q
+        	movhpd   -512(%rdx,%rcx,8),%xmm7                #z2     +=q
+
+# check for near one
+	mov			p_n1(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one1
+.L__vlog2n:
+
+
+	# solve for ln(1+u)
+		movapd	%xmm8,%xmm9				# u
+		mulpd	%xmm8,%xmm8				# u^2
+		movapd	%xmm8,%xmm5
+		movapd	.L__real_cb3(%rip),%xmm3
+		mulpd	%xmm8,%xmm3				#Cu2
+		mulpd	%xmm9,%xmm5				# u^3
+		addpd	.L__real_cb2(%rip),%xmm3 		#B+Cu2
+
+		mulpd	%xmm5,%xmm8				# u^5
+		movapd	.L__real_log2_lead(%rip),%xmm4
+
+		mulpd	.L__real_cb1(%rip),%xmm5 		#Au3
+		addpd	%xmm5,%xmm9				# u+Au3
+		mulpd	%xmm3,%xmm8				# u5(B+Cu2)
+
+		movapd	p_xexp2(%rsp),%xmm5			# xexp
+		addpd	%xmm8,%xmm9				# poly
+	# recombine
+		mulpd	%xmm5,%xmm4
+		addpd	%xmm4,%xmm7				#r1
+		movapd 	%xmm7,%xmm6			#for log10
+
+		lea		.L__np_ln_tail_table(%rip),%rdx
+		mulpd 	.L__real_log10e_tail(%rip),%xmm7 #for log10
+	        movlpd   -512(%rdx,%rax,8),%xmm2                #z2     +=q
+	    mulpd 	.L__real_log10e_lead(%rip),%xmm6 #for log10
+        	movhpd   -512(%rdx,%rcx,8),%xmm2                #z2     +=q
+		addpd	%xmm2,%xmm9
+
+		mulpd	.L__real_log2_tail(%rip),%xmm5
+
+		addpd	%xmm5,%xmm9				#r2
+		movapd  %xmm9,%xmm8				#for log10
+		mulpd 	.L__real_log10e_tail(%rip),%xmm9 #for log 10
+		addpd   %xmm9,%xmm7 	#for log10
+		mulpd 	.L__real_log10e_lead(%rip),%xmm8 #for log10
+		addpd	%xmm8,%xmm7 	#for log10
+
+	# check for nans/infs
+		test		$3,%r10d
+		addpd   %xmm6,%xmm7	#for log10
+#		addpd	%xmm9,%xmm7
+		jnz		.L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+		test		$3,%r11d
+		jnz		.L__z_or_n2
+
+.L__vlog4:
+
+
+	mov			p_n12(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one2
+
+.L__vlog4n:
+
+# store the result _m128d
+		movapd	%xmm7,%xmm1
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+	.align	16
+.Lboth_nearone:
+# saves 10 cycles
+#      r = x - 1.0;
+	movapd	.L__real_two(%rip),%xmm2
+	subpd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addpd	%xmm0,%xmm2
+	movapd	%xmm0,%xmm1
+	divpd	%xmm2,%xmm1		# u
+	movapd	.L__real_ca4(%rip),%xmm4	  #D
+	movapd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movapd	%xmm0,%xmm6
+	mulpd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addpd	%xmm1,%xmm1		#u
+	movapd	%xmm1,%xmm2
+	mulpd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulpd	%xmm1,%xmm5		# Cu
+	movapd	%xmm1,%xmm3
+	mulpd	%xmm2,%xmm3		# u^3
+	mulpd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulpd	%xmm3,%xmm4		#Du^3
+
+	addpd	.L__real_ca1(%rip),%xmm2	# +A
+	movapd	%xmm3,%xmm1
+	mulpd	%xmm1,%xmm1		# u^6
+	addpd	%xmm4,%xmm5		#Cu+Du3
+
+	mulpd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulpd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addpd	%xmm1,%xmm2
+	subpd	%xmm6,%xmm2		# -correction
+
+#	loge to log10
+	movapd 	%xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subpd	%xmm3,%xmm0
+	addpd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movapd 	%xmm3,%xmm0
+	movapd	%xmm2,%xmm1
+
+	mulpd 	.L__real_log10e_tail(%rip),%xmm2
+	mulpd 	.L__real_log10e_tail(%rip),%xmm0
+	mulpd 	.L__real_log10e_lead(%rip),%xmm1
+	mulpd 	.L__real_log10e_lead(%rip),%xmm3
+	addpd 	%xmm2,%xmm0
+	addpd 	%xmm1,%xmm0
+	addpd	%xmm3,%xmm0
+#      return r + r2;
+#	addpd	%xmm2,%xmm0
+	ret
+
+	.align	16
+.L__near_one1:
+	cmp	$3,%r9d
+	jnz		.L__n1nb1
+
+	movapd	p_x(%rsp),%xmm0
+	call	.Lboth_nearone
+	jmp		.L__vlog2n
+
+	.align	16
+.L__n1nb1:
+	test	$1,%r9d
+	jz		.L__lnn12
+
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__ln1
+
+.L__lnn12:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn1e
+	movlpd	%xmm0,p_x(%rsp)
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,p_x+8(%rsp)
+	movapd	p_x(%rsp),%xmm0
+
+.L__lnn1e:
+	jmp		.L__vlog2n
+
+
+	.align	16
+.L__near_one2:
+	cmp	$3,%r9d
+	jnz		.L__n1nb2
+
+	movapd	%xmm0,%xmm8
+	movapd	p_x2(%rsp),%xmm0
+	call	.Lboth_nearone
+	movapd	%xmm0,%xmm7
+	movapd	%xmm8,%xmm0
+	jmp		.L__vlog4n
+
+	.align	16
+.L__n1nb2:
+	movapd	%xmm0,%xmm8
+	test	$1,%r9d
+	jz		.L__lnn22
+
+	movapd	%xmm7,%xmm0
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__ln1
+	movapd	%xmm0,%xmm7
+
+.L__lnn22:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn2e
+	movlpd	%xmm7,p_x2(%rsp)
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,p_x2+8(%rsp)
+	movapd	p_x2(%rsp),%xmm7
+
+.L__lnn2e:
+	movapd	%xmm8,%xmm0
+	jmp		.L__vlog4n
+
+	.align	16
+
+.L__ln1:
+# saves 10 cycles
+#      r = x - 1.0;
+	movlpd	.L__real_two(%rip),%xmm2
+	subsd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addsd	%xmm0,%xmm2
+	movsd	%xmm0,%xmm1
+	divsd	%xmm2,%xmm1		# u
+	movlpd	.L__real_ca4(%rip),%xmm4	  #D
+	movlpd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movsd	%xmm0,%xmm6
+	mulsd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addsd	%xmm1,%xmm1		#u
+	movsd	%xmm1,%xmm2
+	mulsd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulsd	%xmm1,%xmm5		# Cu
+	movsd	%xmm1,%xmm3
+	mulsd	%xmm2,%xmm3		# u^3
+	mulsd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulsd	%xmm3,%xmm4		#Du^3
+
+	addsd	.L__real_ca1(%rip),%xmm2	# +A
+	movsd	%xmm3,%xmm1
+	mulsd	%xmm1,%xmm1		# u^6
+	addsd	%xmm4,%xmm5		#Cu+Du3
+
+	mulsd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulsd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addsd	%xmm1,%xmm2
+	subsd	%xmm6,%xmm2		# -correction
+
+#	loge to log10
+	movsd 	%xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subsd	%xmm3,%xmm0
+	addsd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movsd 	%xmm3,%xmm0
+	movsd	%xmm2,%xmm1
+
+	mulsd 	.L__real_log10e_tail(%rip),%xmm2
+	mulsd 	.L__real_log10e_tail(%rip),%xmm0
+	mulsd 	.L__real_log10e_lead(%rip),%xmm1
+	mulsd 	.L__real_log10e_lead(%rip),%xmm3
+	addsd 	%xmm2,%xmm0
+	addsd 	%xmm1,%xmm0
+	addsd	%xmm3,%xmm0
+
+#      return r + r2;
+#	addsd	%xmm2,%xmm0
+	ret
+
+	.align	16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+	test		$1,%r8d		# first number?
+	jz		.L__lninf2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rdx
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+	test		$2,%r8d		# second number?
+	jz		.L__lninfe
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rdx
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+	jmp		.L__vlog1		# continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+	movapd	%xmm0,%xmm2
+	test		$1,%r10d		# first number?
+	jz		.L__lninf22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm7,%xmm1		# save the inputs
+	mov		p_x2(%rsp),%rdx
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm7,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+	movapd	%xmm0,%xmm7
+
+.L__lninf22:
+	test		$2,%r10d		# second number?
+	jz		.L__lninfe2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rdx
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+	movapd	%xmm2,%xmm0
+	jmp		.L__vlog3		# continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__lnan					# jump if mantissa not zero, so it's a NaN
+# inf
+	rcl		$1,%rdx
+	jnc		.L__lne2					# log(+inf) = inf
+# negative x
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+#NaN
+.L__lnan:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__lne:
+	movd	%rdx,%xmm0
+.L__lne2:
+	ret
+
+	.align	16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+	test		$1,%r9d		# first number?
+	jz		.L__zn2
+
+	mov		%rax,p2_temp(%rsp)
+ 	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+	test		$2,%r9d		# second number?
+	jz		.L__zne
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne:
+	jmp		.L__vlog2
+
+.L__z_or_n2:
+	movapd	%xmm0,%xmm2
+	test		$1,%r11d		# first number?
+	jz		.L__zn22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm7,%xmm0
+	movapd	%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+	test		$2,%r11d		# second number?
+	jz		.L__zne2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+	movapd	%xmm2,%xmm0
+	jmp		.L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+	shl		$1,%rax
+	jnz		.L__zn_x		 ## if just a carry, then must be negative
+	movlpd	.L__real_ninf(%rip),%xmm0  # C99 specs -inf for +-0
+	ret
+.L__zn_x:
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+
+
+	.data
+	.align	16
+
+.L__real_one:				.quad 0x03ff0000000000000	# 1.0
+						.quad 0x03ff0000000000000					# for alignment
+.L__real_two:				.quad 0x04000000000000000	# 1.0
+						.quad 0x04000000000000000
+.L__real_ninf:				.quad 0x0fff0000000000000	# -inf
+						.quad 0x0fff0000000000000
+.L__real_inf:				.quad 0x07ff0000000000000	# +inf
+						.quad 0x07ff0000000000000
+.L__real_nan:				.quad 0x07ff8000000000000	# NaN
+						.quad 0x07ff8000000000000
+
+.L__real_zero:				.quad 0x00000000000000000	# 0.0
+						.quad 0x00000000000000000
+
+.L__real_sign:				.quad 0x08000000000000000	# sign bit
+						.quad 0x08000000000000000
+.L__real_notsign:			.quad 0x07ffFFFFFFFFFFFFF	# ^sign bit
+						.quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold:		.quad 0x03FB082C000000000	# .064495086669921875 Threshold
+				.quad 0x03FB082C000000000
+.L__real_qnanbit:			.quad 0x00008000000000000	# quiet nan bit
+						.quad 0x00008000000000000
+.L__real_mant:				.quad 0x0000FFFFFFFFFFFFF	# mantissa bits
+						.quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000:	.quad 0x03f80000000000000	# /* 0.0078125 = 1/128 */
+						.quad 0x03f80000000000000
+.L__mask_1023:				.quad 0x000000000000003ff	#
+						.quad 0x000000000000003ff
+.L__mask_040:				.quad 0x00000000000000040	#
+						.quad 0x00000000000000040
+.L__mask_001:				.quad 0x00000000000000001	#
+						.quad 0x00000000000000001
+
+.L__real_ca1:				.quad 0x03fb55555555554e6	# 8.33333333333317923934e-02
+						.quad 0x03fb55555555554e6
+.L__real_ca2:				.quad 0x03f89999999bac6d4	# 1.25000000037717509602e-02
+						.quad 0x03f89999999bac6d4
+.L__real_ca3:				.quad 0x03f62492307f1519f	# 2.23213998791944806202e-03
+						.quad 0x03f62492307f1519f
+.L__real_ca4:				.quad 0x03f3c8034c85dfff0	# 4.34887777707614552256e-04
+						.quad 0x03f3c8034c85dfff0
+
+
+.L__real_cb1:				.quad 0x03fb5555555555557	# 8.33333333333333593622e-02
+						.quad 0x03fb5555555555557
+.L__real_cb2:				.quad 0x03f89999999865ede	# 1.24999999978138668903e-02
+						.quad 0x03f89999999865ede
+.L__real_cb3:				.quad 0x03f6249423bd94741	# 2.23219810758559851206e-03
+						.quad 0x03f6249423bd94741
+.L__real_log2_lead:  		.quad 0x03fe62e42e0000000	# log2_lead	  6.93147122859954833984e-01
+						.quad 0x03fe62e42e0000000
+.L__real_log2_tail: 		.quad 0x03e6efa39ef35793c	# log2_tail	  5.76999904754328540596e-08
+						.quad 0x03e6efa39ef35793c
+
+.L__real_half:				.quad 0x03fe0000000000000	# 1/2
+						.quad 0x03fe0000000000000
+
+.L__real_log10e_lead:	.quad 0x03fdbcb7800000000	# log10e_lead 4.34293746948242187500e-01
+				.quad 0x03fdbcb7800000000
+.L__real_log10e_tail:	.quad 0x03ea8a93728719535	# log10e_tail 7.3495500964015109100644e-7
+				.quad 0x03ea8a93728719535
+
+.L__mask_lower:			.quad 0x0ffffffff00000000
+				.quad 0x0ffffffff00000000
+	.align	16
+
+.L__np_ln_lead_table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3f8fc0a800000000		# 1.55041813850402832031e-02
+	.quad	0x3f9f829800000000		# 3.07716131210327148438e-02
+	.quad	0x3fa7745800000000		# 4.58095073699951171875e-02
+	.quad	0x3faf0a3000000000		# 6.06245994567871093750e-02
+	.quad	0x3fb341d700000000		# 7.52233862876892089844e-02
+	.quad	0x3fb6f0d200000000		# 8.96121263504028320312e-02
+	.quad	0x3fba926d00000000		# 1.03796780109405517578e-01
+	.quad	0x3fbe270700000000		# 1.17783010005950927734e-01
+	.quad	0x3fc0d77e00000000		# 1.31576299667358398438e-01
+	.quad	0x3fc2955280000000		# 1.45181953907012939453e-01
+	.quad	0x3fc44d2b00000000		# 1.58604979515075683594e-01
+	.quad	0x3fc5ff3000000000		# 1.71850204467773437500e-01
+	.quad	0x3fc7ab8900000000		# 1.84922337532043457031e-01
+	.quad	0x3fc9525a80000000		# 1.97825729846954345703e-01
+	.quad	0x3fcaf3c900000000		# 2.10564732551574707031e-01
+	.quad	0x3fcc8ff780000000		# 2.23143517971038818359e-01
+	.quad	0x3fce270700000000		# 2.35566020011901855469e-01
+	.quad	0x3fcfb91800000000		# 2.47836112976074218750e-01
+	.quad	0x3fd0a324c0000000		# 2.59957492351531982422e-01
+	.quad	0x3fd1675c80000000		# 2.71933674812316894531e-01
+	.quad	0x3fd22941c0000000		# 2.83768117427825927734e-01
+	.quad	0x3fd2e8e280000000		# 2.95464158058166503906e-01
+	.quad	0x3fd3a64c40000000		# 3.07025015354156494141e-01
+	.quad	0x3fd4618bc0000000		# 3.18453729152679443359e-01
+	.quad	0x3fd51aad80000000		# 3.29753279685974121094e-01
+	.quad	0x3fd5d1bd80000000		# 3.40926527976989746094e-01
+	.quad	0x3fd686c800000000		# 3.51976394653320312500e-01
+	.quad	0x3fd739d7c0000000		# 3.62905442714691162109e-01
+	.quad	0x3fd7eaf800000000		# 3.73716354370117187500e-01
+	.quad	0x3fd89a3380000000		# 3.84411692619323730469e-01
+	.quad	0x3fd9479400000000		# 3.94993782043457031250e-01
+	.quad	0x3fd9f323c0000000		# 4.05465066432952880859e-01
+	.quad	0x3fda9cec80000000		# 4.15827870368957519531e-01
+	.quad	0x3fdb44f740000000		# 4.26084339618682861328e-01
+	.quad	0x3fdbeb4d80000000		# 4.36236739158630371094e-01
+	.quad	0x3fdc8ff7c0000000		# 4.46287095546722412109e-01
+	.quad	0x3fdd32fe40000000		# 4.56237375736236572266e-01
+	.quad	0x3fddd46a00000000		# 4.66089725494384765625e-01
+	.quad	0x3fde744240000000		# 4.75845873355865478516e-01
+	.quad	0x3fdf128f40000000		# 4.85507786273956298828e-01
+	.quad	0x3fdfaf5880000000		# 4.95077252388000488281e-01
+	.quad	0x3fe02552a0000000		# 5.04556000232696533203e-01
+	.quad	0x3fe0723e40000000		# 5.13945698738098144531e-01
+	.quad	0x3fe0be72e0000000		# 5.23248136043548583984e-01
+	.quad	0x3fe109f380000000		# 5.32464742660522460938e-01
+	.quad	0x3fe154c3c0000000		# 5.41597247123718261719e-01
+	.quad	0x3fe19ee6a0000000		# 5.50647079944610595703e-01
+	.quad	0x3fe1e85f40000000		# 5.59615731239318847656e-01
+	.quad	0x3fe23130c0000000		# 5.68504691123962402344e-01
+	.quad	0x3fe2795e00000000		# 5.77315330505371093750e-01
+	.quad	0x3fe2c0e9e0000000		# 5.86049020290374755859e-01
+	.quad	0x3fe307d720000000		# 5.94707071781158447266e-01
+	.quad	0x3fe34e2880000000		# 6.03290796279907226562e-01
+	.quad	0x3fe393e0c0000000		# 6.11801505088806152344e-01
+	.quad	0x3fe3d90260000000		# 6.20240390300750732422e-01
+	.quad	0x3fe41d8fe0000000		# 6.28608644008636474609e-01
+	.quad	0x3fe4618bc0000000		# 6.36907458305358886719e-01
+	.quad	0x3fe4a4f840000000		# 6.45137906074523925781e-01
+	.quad	0x3fe4e7d800000000		# 6.53301239013671875000e-01
+	.quad	0x3fe52a2d20000000		# 6.61398470401763916016e-01
+	.quad	0x3fe56bf9c0000000		# 6.69430613517761230469e-01
+	.quad	0x3fe5ad4040000000		# 6.77398800849914550781e-01
+	.quad	0x3fe5ee02a0000000		# 6.85303986072540283203e-01
+	.quad	0x3fe62e42e0000000		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_tail_table:
+	.quad	0x00000000000000000 # 0	; 0.00000000000000000000e+00
+	.quad	0x03e361f807c79f3db		# 5.15092497094772879206e-09
+	.quad	0x03e6873c1980267c8		# 4.55457209735272790188e-08
+	.quad	0x03e5ec65b9f88c69e		# 2.86612990859791781788e-08
+	.quad	0x03e58022c54cc2f99		# 2.23596477332056055352e-08
+	.quad	0x03e62c37a3a125330		# 3.49498983167142274770e-08
+	.quad	0x03e615cad69737c93		# 3.23392843005887000414e-08
+	.quad	0x03e4d256ab1b285e9		# 1.35722380472479366661e-08
+	.quad	0x03e5b8abcb97a7aa2		# 2.56504325268044191098e-08
+	.quad	0x03e6f34239659a5dc		# 5.81213608741512136843e-08
+	.quad	0x03e6e07fd48d30177		# 5.59374849578288093334e-08
+	.quad	0x03e6b32df4799f4f6		# 5.06615629004996189970e-08
+	.quad	0x03e6c29e4f4f21cf8		# 5.24588857848400955725e-08
+	.quad	0x03e1086c848df1b59		# 9.61968535632653505972e-10
+	.quad	0x03e4cf456b4764130		# 1.34829655346594463137e-08
+	.quad	0x03e63a02ffcb63398		# 3.65557749306383026498e-08
+	.quad	0x03e61e6a6886b0976		# 3.33431709374069198903e-08
+	.quad	0x03e6b8abcb97a7aa2		# 5.13008650536088382197e-08
+	.quad	0x03e6b578f8aa35552		# 5.09285070380306053751e-08
+	.quad	0x03e6139c871afb9fc		# 3.20853940845502057341e-08
+	.quad	0x03e65d5d30701ce64		# 4.06713248643004200446e-08
+	.quad	0x03e6de7bcb2d12142		# 5.57028186706125221168e-08
+	.quad	0x03e6d708e984e1664		# 5.48356693724804282546e-08
+	.quad	0x03e556945e9c72f36		# 1.99407553679345001938e-08
+	.quad	0x03e20e2f613e85bda		# 1.96585517245087232086e-09
+	.quad	0x03e3cb7e0b42724f6		# 6.68649386072067321503e-09
+	.quad	0x03e6fac04e52846c7		# 5.89936034642113390002e-08
+	.quad	0x03e5e9b14aec442be		# 2.85038578721554472484e-08
+	.quad	0x03e6b5de8034e7126		# 5.09746772910284482606e-08
+	.quad	0x03e6dc157e1b259d3		# 5.54234668933210171467e-08
+	.quad	0x03e3b05096ad69c62		# 6.29100830926604004874e-09
+	.quad	0x03e5c2116faba4cdd		# 2.61974119468563937716e-08
+	.quad	0x03e665fcc25f95b47		# 4.16752115011186398935e-08
+	.quad	0x03e5a9a08498d4850		# 2.47747534460820790327e-08
+	.quad	0x03e6de647b1465f77		# 5.56922172017964209793e-08
+	.quad	0x03e5da71b7bf7861d		# 2.76162876992552906035e-08
+	.quad	0x03e3e6a6886b09760		# 7.08169709942321478061e-09
+	.quad	0x03e6f0075eab0ef64		# 5.77453510221151779025e-08
+	.quad	0x03e33071282fb989b		# 4.43021445893361960146e-09
+	.quad	0x03e60eb43c3f1bed2		# 3.15140984357495864573e-08
+	.quad	0x03e5faf06ecb35c84		# 2.95077445089736670973e-08
+	.quad	0x03e4ef1e63db35f68		# 1.44098510263167149349e-08
+	.quad	0x03e469743fb1a71a5		# 1.05196987538551827693e-08
+	.quad	0x03e6c1cdf404e5796		# 5.23641361722697546261e-08
+	.quad	0x03e4094aa0ada625e		# 7.72099925253243069458e-09
+	.quad	0x03e6e2d4c96fde3ec		# 5.62089493829364197156e-08
+	.quad	0x03e62f4d5e9a98f34		# 3.53090261098577946927e-08
+	.quad	0x03e6467c96ecc5cbe		# 3.80080516835568242269e-08
+	.quad	0x03e6e7040d03dec5a		# 5.66961038386146408282e-08
+	.quad	0x03e67bebf4282de36		# 4.42287063097349852717e-08
+	.quad	0x03e6289b11aeb783f		# 3.45294525105681104660e-08
+	.quad	0x03e5a891d1772f538		# 2.47132034530447431509e-08
+	.quad	0x03e634f10be1fb591		# 3.59655343422487209774e-08
+	.quad	0x03e6d9ce1d316eb93		# 5.51581770357780862071e-08
+	.quad	0x03e63562a19a9c442		# 3.60171867511861372793e-08
+	.quad	0x03e54e2adf548084c		# 1.94511067964296180547e-08
+	.quad	0x03e508ce55cc8c97a		# 1.54137376631349347838e-08
+	.quad	0x03e30e2f613e85bda		# 3.93171034490174464173e-09
+	.quad	0x03e6db03ebb0227bf		# 5.52990607758839766440e-08
+	.quad	0x03e61b75bb09cb098		# 3.29990737637586136511e-08
+	.quad	0x03e496f16abb9df22		# 1.18436010922446096216e-08
+	.quad	0x03e65b3f399411c62		# 4.04248680368301346709e-08
+	.quad	0x03e586b3e59f65355		# 2.27418915900284316293e-08
+	.quad	0x03e52482ceae1ac12		# 1.70263791333409206020e-08
+	.quad	0x03e6efa39ef35793c		# 5.76999904754328540596e-08
+	.quad 0					# for alignment
+

diff --git a/src/gas/vrd4log2.S b/src/gas/vrd4log2.S
new file mode 100644
index 0000000..bc254cf
--- /dev/null
+++ b/src/gas/vrd4log2.S

@@ -0,0 +1,908 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd4log2.asm
+#
+# A vector implementation of the log libm function.
+#
+# Prototype:
+#
+#    __m128d,__m128d __vrd4_log2(__m128d x1, __m128d x2);
+#
+#   Computes the natural log of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.  This version can compute 4 logs in
+#   192 cycles, or 48 per value
+#
+# This routine computes 4 double precision log values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+.equ	p_idx,0x010		# index storage
+.equ	p_xexp,0x020		# index storage
+
+.equ	p_x2,0x030		# temporary for error checking operation
+.equ	p_idx2,0x040		# index storage
+.equ	p_xexp2,0x050		# index storage
+
+.equ	save_xa,0x060		#qword
+.equ	save_ya,0x068		#qword
+.equ	save_nv,0x070		#qword
+.equ	p_iter,0x078		# qword	storage for number of loop iterations
+
+.equ	save_rbx,0x080		#qword
+
+
+.equ	p2_temp,0x090		# second temporary for get/put bits operation
+.equ	p2_temp1,0x0b0		# second temporary for exponent multiply
+
+.equ	p_n1,0x0c0		# temporary for near one check
+.equ	p_n12,0x0d0		# temporary for near one check
+
+
+.equ	stack_size,0x0e8
+
+
+# parameters are expected as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrd4_log2
+    .type   __vrd4_log2,@function
+__vrd4_log2:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rdi
+
+# process 4 values at a time.
+
+		movdqa	%xmm1,p_x2(%rsp)	# save the input values
+	movdqa	%xmm0,p_x(%rsp)	# save the input values
+# compute the logs
+
+##  if NaN or inf
+
+#      /* Store the exponent of x in xexp and put
+#         f into the range [0.5,1) */
+
+	pxor	%xmm1,%xmm1
+	movdqa	%xmm0,%xmm3
+	psrlq	$52,%xmm3
+	psubq	.L__mask_1023(%rip),%xmm3
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm6			# xexp
+	movdqa	%xmm0,%xmm2
+	subpd	.L__real_one(%rip),%xmm2
+
+	movapd	%xmm6,p_xexp(%rsp)
+	andpd	.L__real_notsign(%rip),%xmm2
+	xor		%rax,%rax
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+
+	cmppd	$1,.L__real_threshold(%rip),%xmm2
+	movmskpd	%xmm2,%ecx
+	movdqa	%xmm3,%xmm4
+	mov			%ecx,p_n1(%rsp)
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrlq	$45,%xmm3
+	movdqa	%xmm3,%xmm2
+	psrlq	$1,%xmm3
+	paddq	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm2
+	paddq	%xmm2,%xmm3
+
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm1
+		pxor	%xmm7,%xmm7
+		movdqa	p_x2(%rsp),%xmm2
+		movapd	p_x2(%rsp),%xmm5
+		psrlq	$52,%xmm2
+		psubq	.L__mask_1023(%rip),%xmm2
+		packssdw	%xmm7,%xmm2
+		subpd	.L__real_one(%rip),%xmm5
+		andpd	.L__real_notsign(%rip),%xmm5
+		cvtdq2pd	%xmm2,%xmm6			# xexp
+	xor		%rcx,%rcx
+		cmppd	$1,.L__real_threshold(%rip),%xmm5
+	movq	 %xmm3,p_idx(%rsp)
+
+# reduce and get u
+	por		.L__real_half(%rip),%xmm4
+	movdqa	%xmm4,%xmm2
+		movapd	%xmm6,p_xexp2(%rsp)
+
+	# do near one check
+		movmskpd	%xmm5,%edx
+		mov			%edx,p_n12(%rsp)
+
+	mulpd	.L__real_3f80000000000000(%rip),%xmm1				# f1 = index/128
+
+
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	mov		p_idx(%rsp),%eax
+		movdqa	p_x2(%rsp),%xmm6
+
+	movapd	.L__real_half(%rip),%xmm5							# .5
+	subpd	%xmm1,%xmm2											# f2 = f - f1
+		pand	.L__real_mant(%rip),%xmm6
+	mulpd	%xmm2,%xmm5
+	addpd	%xmm5,%xmm1
+
+		movdqa	%xmm6,%xmm8
+		psrlq	$45,%xmm6
+		movdqa	%xmm6,%xmm4
+
+		psrlq	$1,%xmm6
+		paddq	.L__mask_040(%rip),%xmm6
+		pand	.L__mask_001(%rip),%xmm4
+		paddq	%xmm4,%xmm6
+# do error checking here for scheduling.  Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+##  if NaN or inf
+	movapd	%xmm0,%xmm3
+	andpd	.L__real_inf(%rip),%xmm3
+	cmppd	$0,.L__real_inf(%rip),%xmm3
+	movmskpd	%xmm3,%r8d
+		packssdw	%xmm7,%xmm6
+		por		.L__real_half(%rip),%xmm8
+		movq	 %xmm6,p_idx2(%rsp)
+		cvtdq2pd	%xmm6,%xmm9
+
+	cmppd	$2,.L__real_zero(%rip),%xmm0
+		mulpd	.L__real_3f80000000000000(%rip),%xmm9				# f1 = index/128
+	movmskpd	%xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+	divpd	%xmm1,%xmm2				# u
+
+# compute the index into the log tables
+#
+
+        movlpd   -512(%rdx,%rax,8),%xmm0                # z1
+        mov             p_idx+4(%rsp),%ecx
+        movhpd   -512(%rdx,%rcx,8),%xmm0                # z1
+# solve for ln(1+u)
+	movapd	%xmm2,%xmm1				# u
+	mulpd	%xmm2,%xmm2				# u^2
+	movapd	%xmm2,%xmm5
+	movapd	.L__real_cb3(%rip),%xmm3
+	mulpd	%xmm2,%xmm3				#Cu2
+	mulpd	%xmm1,%xmm5				# u^3
+	addpd	.L__real_cb2(%rip),%xmm3 #B+Cu2
+
+	mulpd	%xmm5,%xmm2				# u^5
+	movapd	.L__real_log2e_lead(%rip),%xmm4
+
+	mulpd	.L__real_cb1(%rip),%xmm5 #Au3
+	addpd	%xmm5,%xmm1				# u+Au3
+	movapd	%xmm0,%xmm5				#z1 copy
+	mulpd	%xmm3,%xmm2				# u5(B+Cu2)
+	movapd	.L__real_log2e_tail(%rip),%xmm3
+
+	movapd	p_xexp(%rsp),%xmm6		# xexp
+	addpd	%xmm2,%xmm1				# poly
+# recombine
+	lea		.L__np_ln_tail_table(%rip),%rdx
+        movlpd   -512(%rdx,%rax,8),%xmm2                #z2     +=q
+        movhpd   -512(%rdx,%rcx,8),%xmm2                #z2     +=q
+		lea		.L__np_ln_lead_table(%rip),%rdx
+		mov		p_idx2(%rsp),%eax
+		mov		p_idx2+4(%rsp),%ecx
+	addpd	%xmm2,%xmm1	#z2
+	movapd	%xmm1,%xmm2 #z2 copy
+
+
+	mulpd	%xmm4,%xmm5
+	mulpd	%xmm4,%xmm1
+		movapd	.L__real_half(%rip),%xmm4							# .5
+		subpd	%xmm9,%xmm8											# f2 = f - f1
+		mulpd	%xmm8,%xmm4
+		addpd	%xmm4,%xmm9
+	mulpd	%xmm3,%xmm2	#z2*log2e_tail
+	mulpd	%xmm3,%xmm0	#z1*log2e_tail
+	addpd	%xmm6,%xmm5	#r1 = z1*log2e_lead + xexp
+	addpd	%xmm2,%xmm0	#z1*log2e_tail + z2*log2e_tail
+	addpd	%xmm1,%xmm0				#r2
+
+
+		divpd	%xmm9,%xmm8				# u
+		movapd	p_x2(%rsp),%xmm3
+		andpd	.L__real_inf(%rip),%xmm3
+		cmppd	$0,.L__real_inf(%rip),%xmm3
+		movmskpd	%xmm3,%r10d
+		movapd	p_x2(%rsp),%xmm6
+		cmppd	$2,.L__real_zero(%rip),%xmm6
+		movmskpd	%xmm6,%r11d
+
+# check for nans/infs
+	test		$3,%r8d
+	addpd	%xmm5,%xmm0		#r1+r2
+	jnz		.L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+	test		$3,%r9d
+	jnz		.L__z_or_n
+
+
+.L__vlog2:
+
+
+	# It seems like a good idea to try and interleave
+	# even more of the following code sooner into the
+	# program.  But there were conflicts with the table
+	# index registers, making the problem difficult.
+	# After a lot of work in a branch of this file,
+	# I was not able to match the speed of this version.
+	# CodeAnalyst shows that there is lots of unused add
+	# pipe time around the divides, but the processor
+	# doesn't seem to be able to schedule in those slots.
+
+	        movlpd   -512(%rdx,%rax,8),%xmm7                #z2     +=q
+        	movhpd   -512(%rdx,%rcx,8),%xmm7                #z2     +=q
+
+# check for near one
+	mov			p_n1(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one1
+.L__vlog2n:
+
+
+	# solve for ln(1+u)
+		movapd	%xmm8,%xmm9				# u
+		mulpd	%xmm8,%xmm8				# u^2
+		movapd	%xmm8,%xmm5
+		movapd	.L__real_cb3(%rip),%xmm3
+		mulpd	%xmm8,%xmm3				#Cu2
+		mulpd	%xmm9,%xmm5				# u^3
+		addpd	.L__real_cb2(%rip),%xmm3 		#B+Cu2
+
+		mulpd	%xmm5,%xmm8				# u^5
+		movapd	.L__real_log2e_lead(%rip),%xmm4
+
+		mulpd	.L__real_cb1(%rip),%xmm5 		#Au3
+		addpd	%xmm5,%xmm9				# u+Au3
+		movapd	%xmm7,%xmm5				#z1 copy
+		mulpd	%xmm3,%xmm8				# u5(B+Cu2)
+		movapd	.L__real_log2e_tail(%rip),%xmm3
+		movapd	p_xexp2(%rsp),%xmm6			# xexp
+		addpd	%xmm8,%xmm9				# poly
+	# recombine
+		lea		.L__np_ln_tail_table(%rip),%rdx
+	        movlpd   -512(%rdx,%rax,8),%xmm2                #z2     +=q
+        	movhpd   -512(%rdx,%rcx,8),%xmm2                #z2     +=q
+		addpd	%xmm2,%xmm9		#z2
+		movapd	%xmm9,%xmm2		#z2 copy
+
+		mulpd	%xmm4,%xmm5	#z1*log2e_lead
+		mulpd	%xmm4,%xmm9	#z2*log2e_lead
+		mulpd	%xmm3,%xmm2	#z2*log2e_tail
+		mulpd	%xmm3,%xmm7	#z1*log2e_tail
+		addpd	%xmm6,%xmm5	#r1 = z1*log2e_lead + xexp
+		addpd	%xmm2,%xmm7	#z1*log2e_tail + z2*log2e_tail
+
+
+		addpd	%xmm9,%xmm7				#r2
+
+	# check for nans/infs
+		test		$3,%r10d
+		addpd	%xmm5,%xmm7
+		jnz		.L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+		test		$3,%r11d
+		jnz		.L__z_or_n2
+
+.L__vlog4:
+
+
+	mov			p_n12(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one2
+
+.L__vlog4n:
+
+# store the result _m128d
+		movapd	%xmm7,%xmm1
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+	.align	16
+.Lboth_nearone:
+# saves 10 cycles
+#      r = x - 1.0;
+	movapd	.L__real_two(%rip),%xmm2
+	subpd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addpd	%xmm0,%xmm2
+	movapd	%xmm0,%xmm1
+	divpd	%xmm2,%xmm1		# u
+	movapd	.L__real_ca4(%rip),%xmm4	  #D
+	movapd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movapd	%xmm0,%xmm6
+	mulpd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addpd	%xmm1,%xmm1		#u
+	movapd	%xmm1,%xmm2
+	mulpd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulpd	%xmm1,%xmm5		# Cu
+	movapd	%xmm1,%xmm3
+	mulpd	%xmm2,%xmm3		# u^3
+	mulpd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulpd	%xmm3,%xmm4		#Du^3
+
+	addpd	.L__real_ca1(%rip),%xmm2	# +A
+	movapd	%xmm3,%xmm1
+	mulpd	%xmm1,%xmm1		# u^6
+	addpd	%xmm4,%xmm5		#Cu+Du3
+
+	mulpd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulpd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addpd	%xmm1,%xmm2
+	subpd	%xmm6,%xmm2		# -correction
+
+#	loge to log2
+	movapd 	%xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subpd	%xmm3,%xmm0
+	addpd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movapd 	%xmm3,%xmm0
+	movapd	%xmm2,%xmm1
+
+	mulpd 	.L__real_log2e_tail(%rip),%xmm2
+	mulpd 	.L__real_log2e_tail(%rip),%xmm0
+	mulpd 	.L__real_log2e_lead(%rip),%xmm1
+	mulpd 	.L__real_log2e_lead(%rip),%xmm3
+	addpd 	%xmm2,%xmm0
+	addpd 	%xmm1,%xmm0
+	addpd	%xmm3,%xmm0
+
+#      return r + r2;
+#	addpd	%xmm2,%xmm0
+	ret
+
+	.align	16
+.L__near_one1:
+	cmp	$3,%r9d
+	jnz		.L__n1nb1
+
+	movapd	p_x(%rsp),%xmm0
+	call	.Lboth_nearone
+	jmp		.L__vlog2n
+
+	.align	16
+.L__n1nb1:
+	test	$1,%r9d
+	jz		.L__lnn12
+
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__ln1
+
+.L__lnn12:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn1e
+	movlpd	%xmm0,p_x(%rsp)
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,p_x+8(%rsp)
+	movapd	p_x(%rsp),%xmm0
+
+.L__lnn1e:
+	jmp		.L__vlog2n
+
+
+	.align	16
+.L__near_one2:
+	cmp	$3,%r9d
+	jnz		.L__n1nb2
+
+	movapd	%xmm0,%xmm8
+	movapd	p_x2(%rsp),%xmm0
+	call	.Lboth_nearone
+	movapd	%xmm0,%xmm7
+	movapd	%xmm8,%xmm0
+	jmp		.L__vlog4n
+
+	.align	16
+.L__n1nb2:
+	movapd	%xmm0,%xmm8
+	test	$1,%r9d
+	jz		.L__lnn22
+
+	movapd	%xmm7,%xmm0
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__ln1
+	movapd	%xmm0,%xmm7
+
+.L__lnn22:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn2e
+	movlpd	%xmm7,p_x2(%rsp)
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,p_x2+8(%rsp)
+	movapd	p_x2(%rsp),%xmm7
+
+.L__lnn2e:
+	movapd	%xmm8,%xmm0
+	jmp		.L__vlog4n
+
+	.align	16
+
+.L__ln1:
+# saves 10 cycles
+#      r = x - 1.0;
+	movlpd	.L__real_two(%rip),%xmm2
+	subsd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addsd	%xmm0,%xmm2
+	movsd	%xmm0,%xmm1
+	divsd	%xmm2,%xmm1		# u
+	movlpd	.L__real_ca4(%rip),%xmm4	  #D
+	movlpd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movsd	%xmm0,%xmm6
+	mulsd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addsd	%xmm1,%xmm1		#u
+	movsd	%xmm1,%xmm2
+	mulsd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulsd	%xmm1,%xmm5		# Cu
+	movsd	%xmm1,%xmm3
+	mulsd	%xmm2,%xmm3		# u^3
+	mulsd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulsd	%xmm3,%xmm4		#Du^3
+
+	addsd	.L__real_ca1(%rip),%xmm2	# +A
+	movsd	%xmm3,%xmm1
+	mulsd	%xmm1,%xmm1		# u^6
+	addsd	%xmm4,%xmm5		#Cu+Du3
+
+	mulsd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulsd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addsd	%xmm1,%xmm2
+	subsd	%xmm6,%xmm2		# -correction
+
+
+#	loge to log2
+	movsd 	%xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subsd	%xmm3,%xmm0
+	addsd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movsd 	%xmm3,%xmm0
+	movsd	%xmm2,%xmm1
+
+	mulsd 	.L__real_log2e_tail(%rip),%xmm2
+	mulsd 	.L__real_log2e_tail(%rip),%xmm0
+	mulsd 	.L__real_log2e_lead(%rip),%xmm1
+	mulsd 	.L__real_log2e_lead(%rip),%xmm3
+	addsd 	%xmm2,%xmm0
+	addsd 	%xmm1,%xmm0
+	addsd	%xmm3,%xmm0
+
+#      return r + r2;
+#	addsd	%xmm2,%xmm0
+	ret
+
+	.align	16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+	test		$1,%r8d		# first number?
+	jz		.L__lninf2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rdx
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+	test		$2,%r8d		# second number?
+	jz		.L__lninfe
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rdx
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+	jmp		.L__vlog1		# continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+	movapd	%xmm0,%xmm2
+	test		$1,%r10d		# first number?
+	jz		.L__lninf22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm7,%xmm1		# save the inputs
+	mov		p_x2(%rsp),%rdx
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm7,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+	movapd	%xmm0,%xmm7
+
+.L__lninf22:
+	test		$2,%r10d		# second number?
+	jz		.L__lninfe2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rdx
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+	movapd	%xmm2,%xmm0
+	jmp		.L__vlog3		# continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__lnan					# jump if mantissa not zero, so it's a NaN
+# inf
+	rcl		$1,%rdx
+	jnc		.L__lne2					# log(+inf) = inf
+# negative x
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+#NaN
+.L__lnan:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__lne:
+	movd	%rdx,%xmm0
+.L__lne2:
+	ret
+
+	.align	16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+	test		$1,%r9d		# first number?
+	jz		.L__zn2
+
+	mov		%rax,p2_temp(%rsp)
+ 	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+	test		$2,%r9d		# second number?
+	jz		.L__zne
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne:
+	jmp		.L__vlog2
+
+.L__z_or_n2:
+	movapd	%xmm0,%xmm2
+	test		$1,%r11d		# first number?
+	jz		.L__zn22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm7,%xmm0
+	movapd	%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+	test		$2,%r11d		# second number?
+	jz		.L__zne2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+	movapd	%xmm2,%xmm0
+	jmp		.L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+	shl		$1,%rax
+	jnz		.L__zn_x		 ## if just a carry, then must be negative
+	movlpd	.L__real_ninf(%rip),%xmm0  # C99 specs -inf for +-0
+	ret
+.L__zn_x:
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+
+
+	.data
+	.align	16
+
+.L__real_one:				.quad 0x03ff0000000000000	# 1.0
+						.quad 0x03ff0000000000000					# for alignment
+.L__real_two:				.quad 0x04000000000000000	# 1.0
+						.quad 0x04000000000000000
+.L__real_ninf:				.quad 0x0fff0000000000000	# -inf
+						.quad 0x0fff0000000000000
+.L__real_inf:				.quad 0x07ff0000000000000	# +inf
+						.quad 0x07ff0000000000000
+.L__real_nan:				.quad 0x07ff8000000000000	# NaN
+						.quad 0x07ff8000000000000
+
+.L__real_zero:				.quad 0x00000000000000000	# 0.0
+						.quad 0x00000000000000000
+
+.L__real_sign:				.quad 0x08000000000000000	# sign bit
+						.quad 0x08000000000000000
+.L__real_notsign:			.quad 0x07ffFFFFFFFFFFFFF	# ^sign bit
+						.quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold:		.quad 0x03F9EB85000000000	# .03
+						.quad 0x03F9EB85000000000
+.L__real_qnanbit:			.quad 0x00008000000000000	# quiet nan bit
+						.quad 0x00008000000000000
+.L__real_mant:				.quad 0x0000FFFFFFFFFFFFF	# mantissa bits
+						.quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000:	.quad 0x03f80000000000000	# /* 0.0078125 = 1/128 */
+						.quad 0x03f80000000000000
+.L__mask_1023:				.quad 0x000000000000003ff	#
+						.quad 0x000000000000003ff
+.L__mask_040:				.quad 0x00000000000000040	#
+						.quad 0x00000000000000040
+.L__mask_001:				.quad 0x00000000000000001	#
+						.quad 0x00000000000000001
+
+.L__real_ca1:				.quad 0x03fb55555555554e6	# 8.33333333333317923934e-02
+						.quad 0x03fb55555555554e6
+.L__real_ca2:				.quad 0x03f89999999bac6d4	# 1.25000000037717509602e-02
+						.quad 0x03f89999999bac6d4
+.L__real_ca3:				.quad 0x03f62492307f1519f	# 2.23213998791944806202e-03
+						.quad 0x03f62492307f1519f
+.L__real_ca4:				.quad 0x03f3c8034c85dfff0	# 4.34887777707614552256e-04
+						.quad 0x03f3c8034c85dfff0
+
+
+.L__real_cb1:				.quad 0x03fb5555555555557	# 8.33333333333333593622e-02
+						.quad 0x03fb5555555555557
+.L__real_cb2:				.quad 0x03f89999999865ede	# 1.24999999978138668903e-02
+						.quad 0x03f89999999865ede
+.L__real_cb3:				.quad 0x03f6249423bd94741	# 2.23219810758559851206e-03
+						.quad 0x03f6249423bd94741
+.L__real_log2_lead:  		.quad 0x03fe62e42e0000000	# log2_lead	  6.93147122859954833984e-01
+						.quad 0x03fe62e42e0000000
+.L__real_log2_tail: 		.quad 0x03e6efa39ef35793c	# log2_tail	  5.76999904754328540596e-08
+						.quad 0x03e6efa39ef35793c
+
+.L__real_half:				.quad 0x03fe0000000000000	# 1/2
+						.quad 0x03fe0000000000000
+.L__real_log2e_lead:		.quad 0x03FF7154400000000	# log2e_lead	  1.44269180297851562500E+00
+						.quad 0x03FF7154400000000
+.L__real_log2e_tail :		.quad 0x03ECB295C17F0BBBE	# log2e_tail	  3.23791044778235969970E-06
+						.quad 0x03ECB295C17F0BBBE
+.L__mask_lower:			.quad 0x0ffffffff00000000
+						.quad 0x0ffffffff00000000
+	.align	16
+
+.L__np_ln_lead_table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3f8fc0a800000000		# 1.55041813850402832031e-02
+	.quad	0x3f9f829800000000		# 3.07716131210327148438e-02
+	.quad	0x3fa7745800000000		# 4.58095073699951171875e-02
+	.quad	0x3faf0a3000000000		# 6.06245994567871093750e-02
+	.quad	0x3fb341d700000000		# 7.52233862876892089844e-02
+	.quad	0x3fb6f0d200000000		# 8.96121263504028320312e-02
+	.quad	0x3fba926d00000000		# 1.03796780109405517578e-01
+	.quad	0x3fbe270700000000		# 1.17783010005950927734e-01
+	.quad	0x3fc0d77e00000000		# 1.31576299667358398438e-01
+	.quad	0x3fc2955280000000		# 1.45181953907012939453e-01
+	.quad	0x3fc44d2b00000000		# 1.58604979515075683594e-01
+	.quad	0x3fc5ff3000000000		# 1.71850204467773437500e-01
+	.quad	0x3fc7ab8900000000		# 1.84922337532043457031e-01
+	.quad	0x3fc9525a80000000		# 1.97825729846954345703e-01
+	.quad	0x3fcaf3c900000000		# 2.10564732551574707031e-01
+	.quad	0x3fcc8ff780000000		# 2.23143517971038818359e-01
+	.quad	0x3fce270700000000		# 2.35566020011901855469e-01
+	.quad	0x3fcfb91800000000		# 2.47836112976074218750e-01
+	.quad	0x3fd0a324c0000000		# 2.59957492351531982422e-01
+	.quad	0x3fd1675c80000000		# 2.71933674812316894531e-01
+	.quad	0x3fd22941c0000000		# 2.83768117427825927734e-01
+	.quad	0x3fd2e8e280000000		# 2.95464158058166503906e-01
+	.quad	0x3fd3a64c40000000		# 3.07025015354156494141e-01
+	.quad	0x3fd4618bc0000000		# 3.18453729152679443359e-01
+	.quad	0x3fd51aad80000000		# 3.29753279685974121094e-01
+	.quad	0x3fd5d1bd80000000		# 3.40926527976989746094e-01
+	.quad	0x3fd686c800000000		# 3.51976394653320312500e-01
+	.quad	0x3fd739d7c0000000		# 3.62905442714691162109e-01
+	.quad	0x3fd7eaf800000000		# 3.73716354370117187500e-01
+	.quad	0x3fd89a3380000000		# 3.84411692619323730469e-01
+	.quad	0x3fd9479400000000		# 3.94993782043457031250e-01
+	.quad	0x3fd9f323c0000000		# 4.05465066432952880859e-01
+	.quad	0x3fda9cec80000000		# 4.15827870368957519531e-01
+	.quad	0x3fdb44f740000000		# 4.26084339618682861328e-01
+	.quad	0x3fdbeb4d80000000		# 4.36236739158630371094e-01
+	.quad	0x3fdc8ff7c0000000		# 4.46287095546722412109e-01
+	.quad	0x3fdd32fe40000000		# 4.56237375736236572266e-01
+	.quad	0x3fddd46a00000000		# 4.66089725494384765625e-01
+	.quad	0x3fde744240000000		# 4.75845873355865478516e-01
+	.quad	0x3fdf128f40000000		# 4.85507786273956298828e-01
+	.quad	0x3fdfaf5880000000		# 4.95077252388000488281e-01
+	.quad	0x3fe02552a0000000		# 5.04556000232696533203e-01
+	.quad	0x3fe0723e40000000		# 5.13945698738098144531e-01
+	.quad	0x3fe0be72e0000000		# 5.23248136043548583984e-01
+	.quad	0x3fe109f380000000		# 5.32464742660522460938e-01
+	.quad	0x3fe154c3c0000000		# 5.41597247123718261719e-01
+	.quad	0x3fe19ee6a0000000		# 5.50647079944610595703e-01
+	.quad	0x3fe1e85f40000000		# 5.59615731239318847656e-01
+	.quad	0x3fe23130c0000000		# 5.68504691123962402344e-01
+	.quad	0x3fe2795e00000000		# 5.77315330505371093750e-01
+	.quad	0x3fe2c0e9e0000000		# 5.86049020290374755859e-01
+	.quad	0x3fe307d720000000		# 5.94707071781158447266e-01
+	.quad	0x3fe34e2880000000		# 6.03290796279907226562e-01
+	.quad	0x3fe393e0c0000000		# 6.11801505088806152344e-01
+	.quad	0x3fe3d90260000000		# 6.20240390300750732422e-01
+	.quad	0x3fe41d8fe0000000		# 6.28608644008636474609e-01
+	.quad	0x3fe4618bc0000000		# 6.36907458305358886719e-01
+	.quad	0x3fe4a4f840000000		# 6.45137906074523925781e-01
+	.quad	0x3fe4e7d800000000		# 6.53301239013671875000e-01
+	.quad	0x3fe52a2d20000000		# 6.61398470401763916016e-01
+	.quad	0x3fe56bf9c0000000		# 6.69430613517761230469e-01
+	.quad	0x3fe5ad4040000000		# 6.77398800849914550781e-01
+	.quad	0x3fe5ee02a0000000		# 6.85303986072540283203e-01
+	.quad	0x3fe62e42e0000000		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_tail_table:
+	.quad	0x00000000000000000 # 0	; 0.00000000000000000000e+00
+	.quad	0x03e361f807c79f3db		# 5.15092497094772879206e-09
+	.quad	0x03e6873c1980267c8		# 4.55457209735272790188e-08
+	.quad	0x03e5ec65b9f88c69e		# 2.86612990859791781788e-08
+	.quad	0x03e58022c54cc2f99		# 2.23596477332056055352e-08
+	.quad	0x03e62c37a3a125330		# 3.49498983167142274770e-08
+	.quad	0x03e615cad69737c93		# 3.23392843005887000414e-08
+	.quad	0x03e4d256ab1b285e9		# 1.35722380472479366661e-08
+	.quad	0x03e5b8abcb97a7aa2		# 2.56504325268044191098e-08
+	.quad	0x03e6f34239659a5dc		# 5.81213608741512136843e-08
+	.quad	0x03e6e07fd48d30177		# 5.59374849578288093334e-08
+	.quad	0x03e6b32df4799f4f6		# 5.06615629004996189970e-08
+	.quad	0x03e6c29e4f4f21cf8		# 5.24588857848400955725e-08
+	.quad	0x03e1086c848df1b59		# 9.61968535632653505972e-10
+	.quad	0x03e4cf456b4764130		# 1.34829655346594463137e-08
+	.quad	0x03e63a02ffcb63398		# 3.65557749306383026498e-08
+	.quad	0x03e61e6a6886b0976		# 3.33431709374069198903e-08
+	.quad	0x03e6b8abcb97a7aa2		# 5.13008650536088382197e-08
+	.quad	0x03e6b578f8aa35552		# 5.09285070380306053751e-08
+	.quad	0x03e6139c871afb9fc		# 3.20853940845502057341e-08
+	.quad	0x03e65d5d30701ce64		# 4.06713248643004200446e-08
+	.quad	0x03e6de7bcb2d12142		# 5.57028186706125221168e-08
+	.quad	0x03e6d708e984e1664		# 5.48356693724804282546e-08
+	.quad	0x03e556945e9c72f36		# 1.99407553679345001938e-08
+	.quad	0x03e20e2f613e85bda		# 1.96585517245087232086e-09
+	.quad	0x03e3cb7e0b42724f6		# 6.68649386072067321503e-09
+	.quad	0x03e6fac04e52846c7		# 5.89936034642113390002e-08
+	.quad	0x03e5e9b14aec442be		# 2.85038578721554472484e-08
+	.quad	0x03e6b5de8034e7126		# 5.09746772910284482606e-08
+	.quad	0x03e6dc157e1b259d3		# 5.54234668933210171467e-08
+	.quad	0x03e3b05096ad69c62		# 6.29100830926604004874e-09
+	.quad	0x03e5c2116faba4cdd		# 2.61974119468563937716e-08
+	.quad	0x03e665fcc25f95b47		# 4.16752115011186398935e-08
+	.quad	0x03e5a9a08498d4850		# 2.47747534460820790327e-08
+	.quad	0x03e6de647b1465f77		# 5.56922172017964209793e-08
+	.quad	0x03e5da71b7bf7861d		# 2.76162876992552906035e-08
+	.quad	0x03e3e6a6886b09760		# 7.08169709942321478061e-09
+	.quad	0x03e6f0075eab0ef64		# 5.77453510221151779025e-08
+	.quad	0x03e33071282fb989b		# 4.43021445893361960146e-09
+	.quad	0x03e60eb43c3f1bed2		# 3.15140984357495864573e-08
+	.quad	0x03e5faf06ecb35c84		# 2.95077445089736670973e-08
+	.quad	0x03e4ef1e63db35f68		# 1.44098510263167149349e-08
+	.quad	0x03e469743fb1a71a5		# 1.05196987538551827693e-08
+	.quad	0x03e6c1cdf404e5796		# 5.23641361722697546261e-08
+	.quad	0x03e4094aa0ada625e		# 7.72099925253243069458e-09
+	.quad	0x03e6e2d4c96fde3ec		# 5.62089493829364197156e-08
+	.quad	0x03e62f4d5e9a98f34		# 3.53090261098577946927e-08
+	.quad	0x03e6467c96ecc5cbe		# 3.80080516835568242269e-08
+	.quad	0x03e6e7040d03dec5a		# 5.66961038386146408282e-08
+	.quad	0x03e67bebf4282de36		# 4.42287063097349852717e-08
+	.quad	0x03e6289b11aeb783f		# 3.45294525105681104660e-08
+	.quad	0x03e5a891d1772f538		# 2.47132034530447431509e-08
+	.quad	0x03e634f10be1fb591		# 3.59655343422487209774e-08
+	.quad	0x03e6d9ce1d316eb93		# 5.51581770357780862071e-08
+	.quad	0x03e63562a19a9c442		# 3.60171867511861372793e-08
+	.quad	0x03e54e2adf548084c		# 1.94511067964296180547e-08
+	.quad	0x03e508ce55cc8c97a		# 1.54137376631349347838e-08
+	.quad	0x03e30e2f613e85bda		# 3.93171034490174464173e-09
+	.quad	0x03e6db03ebb0227bf		# 5.52990607758839766440e-08
+	.quad	0x03e61b75bb09cb098		# 3.29990737637586136511e-08
+	.quad	0x03e496f16abb9df22		# 1.18436010922446096216e-08
+	.quad	0x03e65b3f399411c62		# 4.04248680368301346709e-08
+	.quad	0x03e586b3e59f65355		# 2.27418915900284316293e-08
+	.quad	0x03e52482ceae1ac12		# 1.70263791333409206020e-08
+	.quad	0x03e6efa39ef35793c		# 5.76999904754328540596e-08
+	.quad 0					# for alignment
+

diff --git a/src/gas/vrd4sin.S b/src/gas/vrd4sin.S
new file mode 100644
index 0000000..b611dfd
--- /dev/null
+++ b/src/gas/vrd4sin.S

@@ -0,0 +1,2915 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# vrd4sin.s
+#
+# A vector implementation of the sin libm function.
+#
+# Prototype:
+#
+#    __m128d,__m128d __vrd4_sin(__m128d x1, __m128d x2);
+#
+# Computes Sine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 double precision Sine values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+# Author: Harsha Jagasia
+# Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: 	.quad 0x07fffffffffffffff	#Sign bit zero
+				.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: 	.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+				.quad 0x03e40000000000000
+.L__real_3fe0000000000000: 	.quad 0x03fe0000000000000	# 0.5
+				.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: 	.quad 0x03fc5555555555555	# 0.166666666666
+				.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: 	.quad 0x03fe45f306dc9c883	# twobypi
+				.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: 	.quad 0x03ff921fb54400000	# piby2_1
+				.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: 	.quad 0x03dd0b4611a626331	# piby2_1tail
+				.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: 	.quad 0x03dd0b4611a600000	# piby2_2
+				.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: 	.quad 0x03ba3198a2e037073	# piby2_2tail
+				.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: 	.quad 0x0fffffffff8000000	# mask for stripping head and tail
+				.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+				.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+				.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+				.quad 0
+.L__reald_one_zero:		.quad 0x00000000100000000	# sin_cos_filter
+				.quad 0
+.L__reald_zero_one:		.quad 0x00000000000000001	#
+				.quad 0
+.L__reald_two_zero:		.quad 0x00000000200000000	#
+				.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+				.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+				.quad 0x03ff0000000000000	#
+.L__real_zero:			.quad 0x00000000000000000	#
+				.quad 0x00000000000000000	#
+.L__real_one:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.Lcosarray:
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x03fa5555555555555
+	.quad	0x0bf56c16c16c16967		# -0.00138889	   		c2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03efa01a019f4ec90		# 2.48016e-005			c3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0be927e4fa17f65f6		# -2.75573e-007			c4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03e21eeb69037ab78		# 2.08761e-009			c5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0bda907db46cc5e42		# -1.13826e-011	   		c6
+	.quad	0x0bda907db46cc5e42
+.Lsinarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bfc5555555555555
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x03f81111111110bb3
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0bf2a01a019e83e5c
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x03ec71de3796cde01
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0be5ae600b42fdfa7
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+	.quad	0x03de5e0b2f9a43bb8
+.Lsincosarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+	.quad	0x0bda907db46cc5e42
+.Lcossinarray:
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0bda907db46cc5e42
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+
+.Levensin_oddcos_tbl:
+		.quad	.Lsinsin_sinsin_piby4		# 0
+		.quad	.Lsinsin_sincos_piby4		# 1
+		.quad	.Lsinsin_cossin_piby4		# 2
+		.quad	.Lsinsin_coscos_piby4		# 3
+
+		.quad	.Lsincos_sinsin_piby4		# 4
+		.quad	.Lsincos_sincos_piby4		# 5
+		.quad	.Lsincos_cossin_piby4		# 6
+		.quad	.Lsincos_coscos_piby4		# 7
+
+		.quad	.Lcossin_sinsin_piby4		# 8
+		.quad	.Lcossin_sincos_piby4		# 9
+		.quad	.Lcossin_cossin_piby4		# 10
+		.quad	.Lcossin_coscos_piby4		# 11
+
+		.quad	.Lcoscos_sinsin_piby4		# 12
+		.quad	.Lcoscos_sincos_piby4		# 13
+		.quad	.Lcoscos_cossin_piby4		# 14
+		.quad	.Lcoscos_coscos_piby4		# 15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+    .text
+    .align 16
+    .p2align 4,,15
+
+# define local variable storage offsets
+.equ	p_temp,		0x00		# temporary for get/put bits operation
+.equ	p_temp1,	0x10		# temporary for get/put bits operation
+
+.equ	save_xmm6,	0x20		# temporary for get/put bits operation
+.equ	save_xmm7,	0x30		# temporary for get/put bits operation
+.equ	save_xmm8,	0x40		# temporary for get/put bits operation
+.equ	save_xmm9,	0x50		# temporary for get/put bits operation
+.equ	save_xmm10,	0x60		# temporary for get/put bits operation
+.equ	save_xmm11,	0x70		# temporary for get/put bits operation
+.equ	save_xmm12,	0x80		# temporary for get/put bits operation
+.equ	save_xmm13,	0x90		# temporary for get/put bits operation
+.equ	save_xmm14,	0x0A0		# temporary for get/put bits operation
+.equ	save_xmm15,	0x0B0		# temporary for get/put bits operation
+
+.equ	r,		0x0C0		# pointer to r for remainder_piby2
+.equ	rr,		0x0D0		# pointer to r for remainder_piby2
+.equ	region,		0x0E0		# pointer to r for remainder_piby2
+
+.equ	r1,		0x0F0		# pointer to r for remainder_piby2
+.equ	rr1,		0x0100		# pointer to r for remainder_piby2
+.equ	region1,	0x0110		# pointer to r for remainder_piby2
+
+.equ	p_temp2,	0x0120		# temporary for get/put bits operation
+.equ	p_temp3,	0x0130		# temporary for get/put bits operation
+
+.equ	p_temp4,	0x0140		# temporary for get/put bits operation
+.equ	p_temp5,	0x0150		# temporary for get/put bits operation
+
+.equ	p_original,	0x0160		# original x
+.equ	p_mask,		0x0170		# original x
+.equ	p_sign,		0x0180		# original x
+
+.equ	p_original1,	0x0190		# original x
+.equ	p_mask1,	0x01A0		# original x
+.equ	p_sign1,	0x01B0		# original x
+
+.equ	save_r12,	0x01C0		# temporary for get/put bits operation
+.equ	save_r13,	0x01D0		# temporary for get/put bits operation
+
+.globl __vrd4_sin
+    .type   __vrd4_sin,@function
+__vrd4_sin:
+
+	sub		$0x1E8,%rsp
+	mov		%r12,save_r12(%rsp)	# save r12
+	mov		%r13,save_r13(%rsp)	# save r13
+
+#DEBUG
+#	jmp 	.Lfinal_check
+#DEBUG
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+movdqa	%xmm0,%xmm6
+movdqa	%xmm1,%xmm7
+movapd	.L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd 	%xmm2,%xmm0				#Unsign
+andpd 	%xmm2,%xmm1				#Unsign
+
+movd	%xmm0,%rax				#rax is lower arg
+movhpd	%xmm0, p_temp+8(%rsp)
+mov    	p_temp+8(%rsp),%rcx			#rcx = upper arg
+movd	%xmm1,%r8				#r8 is lower arg
+movhpd	%xmm1, p_temp1+8(%rsp)
+mov    	p_temp1+8(%rsp),%r9			#r9 = upper arg
+
+movdqa	%xmm0,%xmm12
+movdqa	%xmm1,%xmm13
+
+pcmpgtd		%xmm6,%xmm12
+pcmpgtd		%xmm7,%xmm13
+movdqa		%xmm12,%xmm6
+movdqa		%xmm13,%xmm7
+psrldq		$4,%xmm12
+psrldq		$4,%xmm13
+psrldq		$8,%xmm6
+psrldq		$8,%xmm7
+
+mov 	$0x3FE921FB54442D18,%rdx			#piby4	+
+mov	$0x411E848000000000,%r10			#5e5	+
+movapd	.L__real_3fe0000000000000(%rip),%xmm4		#0.5 for later use	+
+
+por	%xmm6,%xmm12
+por	%xmm7,%xmm13
+movd	%xmm12,%r12				#Move Sign to gpr **
+movd	%xmm13,%r13				#Move Sign to gpr **
+
+movapd	%xmm0,%xmm2				#x0
+movapd	%xmm1,%xmm3				#x1
+movapd	%xmm0,%xmm6				#x0
+movapd	%xmm1,%xmm7				#x1
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax
+	jae	.Lfirst_or_next3_arg_gt_5e5
+
+	cmp	%r10,%rcx
+	jae	.Lsecond_or_next2_arg_gt_5e5
+
+	cmp	%r10,%r8
+	jae	.Lthird_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfourth_arg_gt_5e5
+
+
+#      /* Find out what multiple of piby2 */
+#        npi2  = (int)(x * twobypi + 0.5);
+	movapd	.L__real_3fe45f306dc9c883(%rip),%xmm0
+	mulpd	%xmm0,%xmm2						# * twobypi
+	mulpd	%xmm0,%xmm3						# * twobypi
+
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%r8						# Region
+	movd	%xmm5,%r9						# Region
+
+	mov 	.L__reald_one_zero(%rip),%rdx			#compare value for cossin path
+	mov	%r8,%r10
+	mov	%r9,%r11
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm1,%xmm7						# t-rhead
+
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm0 =rhead, xmm8 =rtail
+# xmm5  = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+	and	.L__reald_one_one(%rip),%r8			#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9			#odd/even region for cos/sin
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+# GET_BITS_DP64(rhead-rtail, uy);			   		; originally only rhead
+# xmm4  = Sign, xmm0 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 =rhead, xmm9 =rtail
+	movapd	%xmm0,%xmm6						# rhead
+	movapd	%xmm1,%xmm7						# rhead
+
+	subpd	%xmm8,%xmm0						# r = rhead - rtail
+	subpd	%xmm9,%xmm1						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+	subpd	%xmm0,%xmm6				#rr=rhead-r
+	subpd	%xmm1,%xmm7				#rr=rhead-r
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm0,%xmm2				# move r for r2
+	movapd	%xmm1,%xmm3				# move r for r2
+
+	mulpd	%xmm0,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	subpd	%xmm8,%xmm6				#rr=(rhead-r) -rtail
+	subpd	%xmm9,%xmm7				#rr=(rhead-r) -rtail
+
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+#DEBUG
+#	jmp	.Lfinal_check
+#DEBUG
+
+	leaq	 .Levensin_oddcos_tbl(%rip),%rsi
+	jmp	 *(%rsi,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm10, xmm12
+#	    %xmm11,,%xmm9 xmm13
+
+
+	movlpd	 %xmm0,r(%rsp)			#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm0,%xmm0			#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm8 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm12 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm10					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	%xmm6,%xmm0
+       subsd	%xmm10,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm0,%xmm6					# rr=rhead-r
+       subsd	%xmm10,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r+8(%rsp)			# store upper r
+       movlpd	 %xmm6,rr+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf
+
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+	movlpd	 r(%rsp),%xmm0				#Restore lower fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				# rr = 0
+	mov	 %r10d,region(%rsp)			# region =0
+
+.align 16
+0:
+	jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+	movhpd		%xmm0,r+8(%rsp)					#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+#	mov	p_original(r%sp),%rax
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	%rax,r(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr(%rsp)				#rr = 0
+	mov	%r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf_of_both_gt_5e5
+
+
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5:
+#	mov	p_original+8(%rsp),%rcx				;upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	%rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr+8(%rsp)			#rr = 0
+	mov	%r10d,region+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm10,,%xmm8 xmm12
+#   %xmm9,,%xmm5 xmm11, xmm13
+
+	movhpd	 %xmm0,r+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+#	movlhps	%xmm0,%xmm0			;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+#	movlhps	%xmm2,%xmm2
+#	movlhps	%xmm6,%xmm6
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm7 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm10					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store upper region
+       movsd	%xmm6,%xmm0
+       subsd	%xmm10,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm0,%xmm6					# rr=rhead-r
+       subsd	%xmm10,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r(%rsp)				# store upper r
+       movlpd	 %xmm6,rr(%rsp)				# store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf
+
+
+	mov	 %r8,p_temp(%rsp)
+	mov	 %r9,p_temp2(%rsp)
+	movapd	 %xmm1,p_temp1(%rsp)
+	movapd	 %xmm3,p_temp3(%rsp)
+	movapd	 %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	p_temp(%rsp),%r8
+	mov	p_temp2(%rsp),%r9
+	movapd	p_temp1(%rsp),%xmm1
+	movapd	p_temp3(%rsp),%xmm3
+	movapd	p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf:
+#	mov	p_original+8(%rsp),%rcx		; upper arg is nan/inf
+#	mov	r+8(%rsp),%rcx			; upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr+8(%rsp)			# rr = 0
+	mov	 %r10d,region+4(%rsp)			# region =0
+
+.align 16
+0:
+
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r8
+	jae	.Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4			#Restore 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm3						# * twobypi
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+       movq	 %xmm5,region1(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm1,%xmm7						# t-rhead
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm1,%xmm7						# rhead
+       subpd	%xmm9,%xmm1						# r = rhead - rtail
+       movapd	 %xmm1,r1(%rsp)
+
+       subpd	%xmm1,%xmm7						# rr=rhead-r
+       subpd	%xmm9,%xmm7						# rr=(rhead-r) -rtail
+       movapd	 %xmm7,rr1(%rsp)
+
+	jmp	.L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use 	%xmm11,,%xmm9 xmm13
+# 	%xmm8,,%xmm5 xmm10, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2						# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm4,region(%rsp)				# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm0,%xmm6						# rhead
+       subpd	%xmm8,%xmm0						# r = rhead - rtail
+       movapd	 %xmm0,r(%rsp)
+
+       subpd	%xmm0,%xmm6						# rr=rhead-r
+       subpd	%xmm8,%xmm6						# rr=(rhead-r) -rtail
+       movapd	 %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+	cmp	%r10,%r9
+	jae	.Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movlpd	 %xmm1,r1(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm1,%xmm1			#Needed since we want to work on upper arg
+	movhlps	%xmm3,%xmm3
+	movhlps	%xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r9d				# r9d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%r9d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm0					# xmm0 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r9d,region1+4(%rsp)			# store upper region
+       movsd	%xmm7,%xmm1
+       subsd	%xmm0,%xmm1					# xmm1 = r=(rhead-rtail)
+       subsd	%xmm1,%xmm7					# rr=rhead-r
+       subsd	%xmm0,%xmm7					# xmm7 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm1,r1+8(%rsp)			# store upper r
+       movlpd	 %xmm7,rr1+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf_higher
+
+	lea	 region1(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr1(%rsp),%rsi
+	lea	 r1(%rsp),%rdi
+	movlpd	 r1(%rsp),%xmm0				#Restore lower fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf_higher:
+#	mov	p_original1(%rsp),%r8			; upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1(%rsp)				# rr = 0
+	mov	 %r10d,region1(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.L__vrd4_sin_reconstruct
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+	movhpd 		%xmm1,r1+8(%rsp)		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher
+
+	mov	 %r9,p_temp1(%rsp)			#Save upper arg
+	lea	 region1(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr1(%rsp),%rsi
+	lea	 r1(%rsp),%rdi
+	movsd	 %xmm1,%xmm0
+        call    __amd_remainder_piby2@PLT
+	mov	 p_temp1(%rsp),%r9			#Restore upper arg
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher:				#lower arg is nan/inf
+#	mov	p_original1(%rsp),%r8
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	%r8,r1(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr1(%rsp)				#rr = 0
+	mov	%r10d,region1(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher
+
+	lea	 region1+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr1+8(%rsp),%rsi
+	lea	 r1+8(%rsp),%rdi
+	movlpd	 r1+8(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher:
+#	mov	p_original1+8(%rsp),%r9			;upper arg is nan/inf
+#	movd	%xmm6,%r9				;upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	%r9,r1+8(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr1+8(%rsp)			#rr = 0
+	mov	%r10d,region1+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp .L__vrd4_sin_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2						# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm4,region(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm0,%xmm6						# rhead
+       subpd	%xmm8,%xmm0						# r = rhead - rtail
+       movapd	 %xmm0,r(%rsp)
+
+       subpd	%xmm0,%xmm6						# rr=rhead-r
+       subpd	%xmm8,%xmm6						# rr=(rhead-r) -rtail
+       movapd	 %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd	 %xmm1,r1+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+#	movlhps	%xmm1,%xmm1			;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+#	movlhps	%xmm3,%xmm3
+#	movlhps	%xmm7,%xmm7
+
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r8d				# r8d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%r8d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm0					# xmm0 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r8d,region1(%rsp)			# store lower region
+       movsd	%xmm7,%xmm1
+       subsd	%xmm0,%xmm1					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm1,%xmm7					# rr=rhead-r
+       subsd	%xmm0,%xmm7					# xmm6 = rr=((rhead-r) -rtail)
+
+       movlpd	 %xmm1,r1(%rsp)				# store upper r
+       movlpd	 %xmm7,rr1(%rsp)				# store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf_higher
+
+	lea	 region1+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr1+8(%rsp),%rsi
+	lea	 r1+8(%rsp),%rdi
+	movlpd	 r1+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf_higher:
+#	mov	p_original1+8(%rsp),%r9		; upper arg is nan/inf
+#	mov	r1+8(%rsp),%r9				; upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	%r9,r1+8(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr1+8(%rsp)			# rr = 0
+	mov	%r10d,region1+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp	.L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	r(%rsp),%xmm0
+	movapd	r1(%rsp),%xmm1
+
+	movapd	rr(%rsp),%xmm6
+	movapd	rr1(%rsp),%xmm7
+
+	mov	region(%rsp),%r8
+	mov	region1(%rsp),%r9
+	mov 	.L__reald_one_zero(%rip),%rdx		#compare value for cossin path
+
+	mov 	%r8,%r10
+	mov 	%r9,%r11
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm0,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm0,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+	leaq	 .Levensin_oddcos_tbl(%rip),%rsi
+	jmp	 *(%rsi,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_cleanup:
+
+	movapd	p_sign(%rsp),%xmm0
+	movapd	p_sign1(%rsp),%xmm1
+	xorpd	%xmm4,%xmm0			# (+) Sign
+	xorpd	%xmm5,%xmm1			# (+) Sign
+
+.Lfinal_check:
+	mov	save_r12(%rsp),%r12	# restore r12
+	mov	save_r13(%rsp),%r13	# restore r13
+
+	add	$0x1E8,%rsp
+	ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+
+
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0	;trash r
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10		# 1 + (-t)	;trash t
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11		# 1 + (-t)	;trash t
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+
+	addpd	.Lcosarray(%rip),%xmm8				# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9				# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm2					# x4 recalculate
+	mulpd	%xmm3,%xmm3					# x4 recalculate
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12		# t relaculate, -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13		# t relaculate, -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subpd   %xmm13,%xmm5					# + t
+
+	jmp	.L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4		# s6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5		# s6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8		# s3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm10,%xmm10				# move high x4 for cos term
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	%xmm0,%xmm6				# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos
+	movhlps	%xmm3,%xmm13				# move high r for cos
+	movhlps	%xmm4,%xmm8				# xmm4 = sin , xmm8 = cos
+	movhlps	%xmm5,%xmm9				# xmm4 = sin , xmm8 = cos
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm7,%xmm5				# sin *x3
+	mulsd	%xmm10,%xmm8				# cos *x4
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	mulsd	p_temp(%rsp),%xmm2		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1(%rsp),%xmm3		# 0.5 * x2 * xx for sin term
+	movsd	%xmm12,%xmm6				# Keep high r for cos term
+	movsd	%xmm13,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	#-t=r-1.0
+
+	subsd	%xmm2,%xmm4				# sin - 0.5 * x2 *xx
+	subsd	%xmm3,%xmm5				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x*xx for cos term
+	movhlps	%xmm1,%xmm11				# move high x for x*xx for cos term
+
+	mulsd	p_temp+8(%rsp),%xmm10		# x * xx
+	mulsd	p_temp1+8(%rsp),%xmm11		# x * xx
+
+	movsd	%xmm12,%xmm2				# move -t for cos term
+	movsd	%xmm13,%xmm3				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm12 	#1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm13 	#1+(-t)
+	addsd	p_temp(%rsp),%xmm4			# sin+xx
+	addsd	p_temp1(%rsp),%xmm5			# sin+xx
+	subsd	%xmm6,%xmm12				# (1-t) - r
+	subsd	%xmm7,%xmm13				# (1-t) - r
+	subsd	%xmm10,%xmm12				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm11,%xmm13				# ((1 + (-t)) - r) - x*xx
+	addsd	%xmm0,%xmm4				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+	addsd   %xmm12,%xmm8				# cos+((1-t)-r - x*xx)
+	addsd   %xmm13,%xmm9				# cos+((1-t)-r - x*xx)
+	subsd   %xmm2,%xmm8				# cos+t
+	subsd   %xmm3,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_sin_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:					# changed from sincos_sincos
+							# xmm1 is cossin and xmm0 is sincos
+
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm1,p_temp3(%rsp)		# Store r for the sincos term
+
+	movapd	.Lsincosarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lcossinarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lsincosarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lcossinarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm10,%xmm10				# move high x4 for cos term
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	movhlps	%xmm3,%xmm7				# move high x2 for x3 for sin term (sincos)
+
+	mulsd	%xmm0,%xmm6				# get low x3 for sin term
+	mulsd	p_temp3+8(%rsp),%xmm7			# get high x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos (cossin)
+	movhlps	%xmm3,%xmm13				# move high 0.5*x2 for sin term (sincos)
+
+	movhlps	%xmm4,%xmm8				# xmm8 = cos , xmm4 = sin	(cossin)
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos	(sincos)
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm11,%xmm5				# cos *x4
+	mulsd	%xmm10,%xmm8				# cos *x4
+	mulsd	%xmm7,%xmm9				# sin *x3
+
+	mulsd	p_temp(%rsp),%xmm2		# low  0.5 * x2 * xx for sin term (cossin)
+	mulsd	p_temp1+8(%rsp),%xmm13		# high 0.5 * x2 * xx for sin term (sincos)
+
+	movsd	%xmm12,%xmm6				# Keep high r for cos term
+	movsd	%xmm3,%xmm7				# Keep low r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	# -t=r-1.0
+
+	subsd	%xmm2,%xmm4				# sin - 0.5 * x2 *xx	(cossin)
+	subsd	%xmm13,%xmm9				# sin - 0.5 * x2 *xx	(sincos)
+
+	movhlps	%xmm0,%xmm10				# move high x for x*xx for cos term (cossin)
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term    (sincos)
+
+	mulsd	p_temp+8(%rsp),%xmm10		# x * xx
+	mulsd	p_temp1(%rsp),%xmm1		# x * xx
+
+	movsd	%xmm12,%xmm2				# move -t for cos term
+	movsd	%xmm3,%xmm13				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm12	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm3 	# 1+(-t)
+
+	addsd	p_temp(%rsp),%xmm4		# sin+xx	+
+	addsd	p_temp1+8(%rsp),%xmm9		# sin+xx	+
+
+	subsd	%xmm6,%xmm12				# (1-t) - r
+	subsd	%xmm7,%xmm3				# (1-t) - r
+
+	subsd	%xmm10,%xmm12				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm1,%xmm3				# ((1 + (-t)) - r) - x*xx
+
+	addsd	%xmm0,%xmm4				# sin + x	+
+	addsd	%xmm11,%xmm9				# sin + x	+
+
+	addsd   %xmm12,%xmm8				# cos+((1-t)-r - x*xx)
+	addsd   %xmm3,%xmm5				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm2,%xmm8				# cos+t
+	subsd   %xmm13,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm0,p_temp2(%rsp)		# Store r
+	movapd	 %xmm1,p_temp3(%rsp)		# Store r
+
+
+	movapd	.Lcossinarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lcossinarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lcossinarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lcossinarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movhlps	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movhlps	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	p_temp2+8(%rsp),%xmm6		# get low x3 for sin term
+	mulsd	p_temp3+8(%rsp),%xmm7		# get low x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high 0.5*x2 for sin term
+	movhlps	%xmm3,%xmm13				# move high 0.5*x2 for sin term
+							# Reverse 12 and 2
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos
+
+	mulsd	%xmm6,%xmm8				# sin *x3
+	mulsd	%xmm7,%xmm9				# sin *x3
+	mulsd	%xmm10,%xmm4				# cos *x4
+	mulsd	%xmm11,%xmm5				# cos *x4
+
+	mulsd	p_temp+8(%rsp),%xmm12		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1+8(%rsp),%xmm13		# 0.5 * x2 * xx for sin term
+	movsd	%xmm2,%xmm6				# Keep high r for cos term
+	movsd	%xmm3,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	#-t=r-1.0
+
+	subsd	%xmm12,%xmm8				# sin - 0.5 * x2 *xx
+	subsd	%xmm13,%xmm9				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term
+							# Reverse 10 and 0
+
+	mulsd	p_temp(%rsp),%xmm0		# x * xx
+	mulsd	p_temp1(%rsp),%xmm1		# x * xx
+
+	movsd	%xmm2,%xmm12				# move -t for cos term
+	movsd	%xmm3,%xmm13				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm2 	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm3 	# 1+(-t)
+	addsd	p_temp+8(%rsp),%xmm8			# sin+xx
+	addsd	p_temp1+8(%rsp),%xmm9			# sin+xx
+
+	subsd	%xmm6,%xmm2				# (1-t) - r
+	subsd	%xmm7,%xmm3				# (1-t) - r
+
+	subsd	%xmm0,%xmm2				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm1,%xmm3				# ((1 + (-t)) - r) - x*xx
+
+	addsd	%xmm10,%xmm8				# sin + x
+	addsd	%xmm11,%xmm9				# sin + x
+
+	addsd   %xmm2,%xmm4				# cos+((1-t)-r - x*xx)
+	addsd   %xmm3,%xmm5				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm12,%xmm4				# cos+t
+	subsd   %xmm13,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:					# changed from sincos_sincos
+							# xmm1 is cossin and xmm0 is sincos
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm0,p_temp2(%rsp)		# Store r
+
+
+	movapd	.Lcossinarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lsincosarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lcossinarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lsincosarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term	+
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movhlps	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term   	+
+	mulsd	p_temp2+8(%rsp),%xmm6		# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term		+
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high 0.5*x2 for sin term
+	movhlps	%xmm3,%xmm13				# move high r for cos
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm9				# xmm9 = cos , xmm5 = sin
+
+	mulsd	%xmm6,%xmm8				# sin *x3
+	mulsd	%xmm11,%xmm9				# cos *x4
+	mulsd	%xmm10,%xmm4				# cos *x4
+	mulsd	%xmm7,%xmm5				# sin *x3
+
+	mulsd	p_temp+8(%rsp),%xmm12		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1(%rsp),%xmm3		# 0.5 * x2 * xx for sin term
+
+	movsd	%xmm2,%xmm6				# Keep high r for cos term
+	movsd	%xmm13,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	#-t=r-1.0
+
+	subsd	%xmm12,%xmm8				# sin - 0.5 * x2 *xx
+	subsd	%xmm3,%xmm5				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x*xx for cos term
+
+	mulsd	p_temp(%rsp),%xmm0		# x * xx
+	mulsd	p_temp1+8(%rsp),%xmm11		# x * xx
+
+	movsd	%xmm2,%xmm12				# move -t for cos term
+	movsd	%xmm13,%xmm3				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm2 	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm13	# 1+(-t)
+
+	addsd	p_temp+8(%rsp),%xmm8		# sin+xx
+	addsd	p_temp1(%rsp),%xmm5		# sin+xx
+
+	subsd	%xmm6,%xmm2				# (1-t) - r
+	subsd	%xmm7,%xmm13				# (1-t) - r
+
+	subsd	%xmm0,%xmm2				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm11,%xmm13				# ((1 + (-t)) - r) - x*xx
+
+
+	addsd	%xmm10,%xmm8				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+
+	addsd   %xmm2,%xmm4				# cos+((1-t)-r - x*xx)
+	addsd   %xmm13,%xmm9				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm12,%xmm4				# cos+t
+	subsd   %xmm3,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	movapd	 %xmm2,p_temp2(%rsp)			# store x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	movapd	%xmm11,p_temp3(%rsp)			# store r
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for 0.5*x2
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm10					# x6
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm12	# 0.5 *x2
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm10,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm0,%xmm2					# x3 recalculate
+	mulpd	%xmm3,%xmm3					# x4 recalculate
+
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm6,%xmm12					# 0.5 * x2 *xx
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd	%xmm12,%xmm4					# -0.5 * x2 *xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd	%xmm6,%xmm4					# x3 * zs +xx
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# t relaculate, -t = r-1
+	addpd	%xmm0,%xmm4					# +x
+	subpd   %xmm13,%xmm5					# + t
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	movapd	 %xmm3,p_temp3(%rsp)			# store x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	movapd	 %xmm10,p_temp2(%rsp)			# store r
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	mulpd	%xmm3,%xmm11					# x4
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for 0.5*x2
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm11					# x6
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm13	# 0.5 *x2
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm2					# x4 recalculate
+	mulpd	%xmm1,%xmm3					# x3 recalculate
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm7,%xmm13					# 0.5 * x2 *xx
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x3 * zs
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;;
+	subpd	%xmm13,%xmm5					# -0.5 * x2 *xx
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd	%xmm7,%xmm5					# +xx
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12	# t relaculate, -t = r-1
+	addpd	%xmm1,%xmm5					# +x
+	subpd   %xmm12,%xmm4					# + t
+
+	jmp	.L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4:				#Derive from cossin_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+	movhlps	 %xmm10,%xmm10					# get upper r for t for cos
+
+	mulpd	 %xmm2,%xmm8					# c3*x2
+	mulpd	 %xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	movsd	%xmm0,%xmm8					# lower x for sin
+	mulsd	%xmm2,%xmm8					# lower x3 for sin
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# upper x4 for cos
+	movsd	%xmm8,%xmm2					# lower x3 for sin
+
+	movsd	%xmm6,%xmm9					# lower xx
+								# note using odd reg
+
+	movlpd   p_temp2+8(%rsp),%xmm12		# upper r for cos term
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx for upper cos term
+	mulpd	%xmm1,%xmm7					# x * xx
+	movhlps	%xmm6,%xmm6
+	mulsd	p_temp2(%rsp),%xmm9 			# xx * 0.5*x2 for sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+								# x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm8= cos, xmm4= sin
+
+	subsd	%xmm9,%xmm4					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addsd   %xmm10,%xmm8					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp(%rsp),%xmm4			# +xx
+
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	subsd   %xmm12,%xmm8					# + t
+	addsd	%xmm0,%xmm4					# +x
+	subpd   %xmm13,%xmm5					# + t
+
+	movlhps	%xmm8,%xmm4
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4:		#Derive from sincos_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcossinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcossinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zszc
+	addpd	%xmm9,%xmm5					# z
+
+	mulpd	%xmm0,%xmm2					# upper x3 for sin
+	mulsd	%xmm0,%xmm2					# lower x4 for cos
+	mulpd	%xmm3,%xmm3					# x4
+
+	movhlps	%xmm6,%xmm9					# upper xx for sin term
+								# note using odd reg
+
+	movlpd  p_temp2(%rsp),%xmm12			# lower r for cos term
+	movapd  p_temp3(%rsp),%xmm13			# r
+
+
+	mulpd	%xmm0,%xmm6					# x * xx for lower cos term
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	mulsd	p_temp2+8(%rsp),%xmm9 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+	mulpd	%xmm3,%xmm5
+								# x4 * zc
+
+	movhlps	%xmm4,%xmm8					# xmm8= sin, xmm4= cos
+	subsd	%xmm9,%xmm8					# x3zs - 0.5*x2*xx
+
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addsd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp+8(%rsp),%xmm8			# +xx
+
+	movhlps	%xmm0,%xmm0					# upper x for sin
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	subsd   %xmm12,%xmm4					# + t
+	subpd   %xmm13,%xmm5					# + t
+	addsd	%xmm0,%xmm8					# +x
+
+	movlhps	%xmm8,%xmm4
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+	movhlps	%xmm11,%xmm11					# get upper r for t for cos
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10		# 1 + (-t)	;trash t
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11		# 1 + (-t)	;trash t
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8				# c1 + x2(c2+x2C3)
+	addpd	.Lsincosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zcs
+
+	movsd	%xmm1,%xmm9					# lower x for sin
+	mulsd	%xmm3,%xmm9					# lower x3 for sin
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# upper x4 for cos
+	movsd	%xmm9,%xmm3					# lower x3 for sin
+
+	movsd	 %xmm7,%xmm8					# lower xx
+								# note using even reg
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+	movlpd   p_temp3+8(%rsp),%xmm13		# upper r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx for upper cos term
+	movhlps	%xmm7,%xmm7
+	mulsd	p_temp3(%rsp),%xmm8 			# xx * 0.5*x2 for sin term
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+								# x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= cos, xmm5= sin
+
+	subsd	%xmm8,%xmm5					# x3zs - 0.5*x2*xx
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd   %xmm11,%xmm9					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1(%rsp),%xmm5			# +xx
+
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12		# t relaculate, -t = r-1
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13		# t relaculate, -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subsd   %xmm13,%xmm9					# + t
+	addsd	%xmm1,%xmm5					# +x
+
+	movlhps	%xmm9,%xmm5
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:		# Derived from sincos_sinsin
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4				# c6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8				# c3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	movhlps	%xmm11,%xmm11
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm6,%xmm10					# 0.5*x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsincosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zczs
+
+	movsd	%xmm3,%xmm12
+	mulsd	%xmm1,%xmm12					# low x3 for sin
+
+	mulpd	%xmm0, %xmm2					# x3
+	mulpd	%xmm3, %xmm3					# high x4 for cos
+	movsd	%xmm12,%xmm3					# low x3 for sin
+
+	movhlps	%xmm1,%xmm8					# upper x for cos term
+								# note using even reg
+	movlpd  p_temp3+8(%rsp),%xmm13			# upper r for cos term
+
+	mulsd	p_temp1+8(%rsp),%xmm8			# x * xx for upper cos term
+
+	mulsd	p_temp3(%rsp),%xmm7 			# xx * 0.5*x2 for lower sin term
+
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= cos, xmm5= sin
+
+	subsd	%xmm7,%xmm5					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm8,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm10,%xmm4					# x3*zs - 0.5*x2*xx
+	addsd   %xmm11,%xmm9					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1(%rsp),%xmm5			# +xx
+
+	addpd	%xmm6,%xmm4					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+
+	addsd	%xmm1,%xmm5					# +x
+	addpd	%xmm0,%xmm4					# +x
+	subsd   %xmm13,%xmm9					# + t
+
+	movlhps	%xmm9,%xmm5
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcossinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcossinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcossinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm7,%xmm8					# upper xx for sin term
+								# note using even reg
+
+	movapd  p_temp2(%rsp),%xmm12			# r
+	movlpd  p_temp3(%rsp),%xmm13			# lower r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx for lower cos term
+
+	mulsd	p_temp3+8(%rsp),%xmm8 			# xx * 0.5*x2 for upper sin term
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm8,%xmm9					# x3zs - 0.5*x2*xx
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1+8(%rsp),%xmm9				# +xx
+
+	movhlps	%xmm1,%xmm1					# upper x for sin
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12		# -t = r-1
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13		# -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subsd   %xmm13,%xmm5					# + t
+	addsd	%xmm1, %xmm9					# +x
+
+	movlhps	%xmm9, %xmm5
+
+	jmp	.L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsincos_sinsin_piby4:		# Derived from sincos_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcossinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcossinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm6,%xmm10					# 0.5x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcossinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm0,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm7,%xmm8					# upper xx for sin term
+								# note using even reg
+
+	movlpd  p_temp3(%rsp),%xmm13			# lower r for cos term
+
+	mulpd	%xmm1,%xmm7					# x * xx for lower cos term
+
+	mulsd	p_temp3+8(%rsp),%xmm8 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm8,%xmm9					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm10,%xmm4					# x3*zs - 0.5*x2*xx
+	addsd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1+8(%rsp),%xmm9			# +xx
+
+	movhlps	%xmm1,%xmm1					# upper x for sin
+	addpd	%xmm6,%xmm4					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	addsd	%xmm1,%xmm9					# +x
+	addpd	%xmm0,%xmm4					# +x
+	subsd   %xmm13,%xmm5					# + t
+
+	movlhps	%xmm9,%xmm5
+
+	jmp	.L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsinsin_cossin_piby4:		# Derived from sincos_sinsin
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# x2
+	movapd	 %xmm6,p_temp(%rsp)			# xx
+
+	movhlps	%xmm10,%xmm10
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm7,%xmm11					# 0.5*x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zs
+
+
+	movsd	%xmm2,%xmm13
+	mulsd	%xmm0,%xmm13					# low x3 for sin
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm2,%xmm2					# high x4 for cos
+	movsd	%xmm13,%xmm2					# low x3 for sin
+
+
+	movhlps	%xmm0,%xmm9					# upper x for cos term								; note using even reg
+	movlpd  p_temp2+8(%rsp),%xmm12				# upper r for cos term
+	mulsd	p_temp+8(%rsp),%xmm9				# x * xx for upper cos term
+	mulsd	p_temp2(%rsp),%xmm6 				# xx * 0.5*x2 for lower sin term
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm8= cos, xmm4= sin
+	subsd	%xmm6,%xmm4					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm9,%xmm10					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm11,%xmm5					# x3*zs - 0.5*x2*xx
+
+	addsd   %xmm10,%xmm8					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp(%rsp),%xmm4			# +xx
+
+	addpd	%xmm7,%xmm5					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+
+	addsd	%xmm0,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+	subsd   %xmm12,%xmm8					# + t
+	movlhps	%xmm8,%xmm4
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:		# Derived from sincos_coscos
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lcossinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcossinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm7,%xmm11					# 0.5x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm0,%xmm2					# upper x3 for sin
+	mulsd	%xmm0,%xmm2					# lower x4 for cos
+
+	movhlps	%xmm6,%xmm9					# upper xx for sin term
+								# note using even reg
+
+	movlpd  p_temp2(%rsp),%xmm12			# lower r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx for lower cos term
+
+	mulsd	p_temp2+8(%rsp),%xmm9 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm9,%xmm8					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm11,%xmm5					# x3*zs - 0.5*x2*xx
+	addsd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp+8(%rsp),%xmm8			# +xx
+
+	movhlps	%xmm0,%xmm0					# upper x for sin
+	addpd	%xmm7,%xmm5					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+
+
+	addsd	%xmm0,%xmm8					# +x
+	addpd	%xmm1,%xmm5					# +x
+	subsd   %xmm12,%xmm4					# + t
+
+	movlhps	%xmm8,%xmm4
+
+	jmp	.L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#DEBUG
+#	xorpd   %xmm0, %xmm0
+#	xorpd   %xmm1, %xmm1
+#	jmp	.Lfinal_check
+#DEBUG
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	movapd	 %xmm2,p_temp2(%rsp)			# copy of x2
+	movapd	 %xmm3,p_temp3(%rsp)			# copy of x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm2,%xmm10					# x6
+	mulpd	%xmm3,%xmm11					# x6
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm2		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm3		# 0.5 *x2
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm6,%xmm2					# 0.5 * x2 *xx
+	mulpd	%xmm7,%xmm3					# 0.5 * x2 *xx
+
+	mulpd	%xmm10,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zs
+
+	movapd	p_temp2(%rsp),%xmm10			# x2
+	movapd	p_temp3(%rsp),%xmm11			# x2
+
+	mulpd	%xmm0,%xmm10					# x3
+	mulpd	%xmm1,%xmm11					# x3
+
+	mulpd	%xmm10,%xmm4					# x3 * zs
+	mulpd	%xmm11,%xmm5					# x3 * zs
+
+	subpd	%xmm2,%xmm4					# -0.5 * x2 *xx
+	subpd	%xmm3,%xmm5					# -0.5 * x2 *xx
+
+	addpd	%xmm6,%xmm4					# +xx
+	addpd	%xmm7,%xmm5					# +xx
+
+	addpd	%xmm0,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+
+	jmp	.L__vrd4_sin_cleanup

diff --git a/src/gas/vrda_scaled_logr.S b/src/gas/vrda_scaled_logr.S
new file mode 100644
index 0000000..9d1bdc1
--- /dev/null
+++ b/src/gas/vrda_scaled_logr.S

@@ -0,0 +1,2428 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrda_scaled_logr.s
+#
+# An array implementation of the log libm function.
+#  Adapted to provide a scalingi and shifting factor.  This routine is
+#  used by the ACML RNG distribution functions.
+#
+# Prototype:
+#
+#    void vrda_scaled_logr(int n, double *x, double *y, double b);
+#
+#   Computes the natural log of x multiplied by a.
+# A reduced precision routine.   Uses the intel novel reduction technique
+# with frcpai to compute logs.
+#  Also uses only 3 polynomial terms to acheive52-18= 34 significant digits
+#
+#   This specialized routine does not handle negative numbers, 0, NaNs, or infinity.
+#   This routine is not C99 compliant
+#   This version can compute logs in 26
+#   cycles with n <= 24
+#
+#
+
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+.equ	p_idx,0x010		# index storage
+.equ	p_xexp,0x020		# index storage
+
+.equ	p_x2,0x030		# temporary for error checking operation
+.equ	p_idx2,0x040		# index storage
+.equ	p_xexp2,0x050		# index storage
+
+.equ	save_xa,0x060		#qword
+.equ	save_ya,0x068		#qword
+.equ	save_nv,0x070		#qword
+.equ	p_iter,0x078		# qword	storage for number of loop iterations
+
+
+
+.equ	p2_temp,0x090		# second temporary for get/put bits operation
+
+.equ	stack_size,0x0e8
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+	.weak vrda_scaled_logr__
+	.set vrda_scaled_logr__,__vrda_scaled_logr__
+	.weak vrda_scaled_logr_
+	.set vrda_scaled_logr_,__vrda_scaled_logr__
+
+# Fortran interface parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+# rcx - double *b
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array log
+#**     VRDA_SCALED_LOG(N,X,Y,B)
+# C equivalent*/
+#void vrda_scaled_logr__(int * n, double *x, double *y,double *b)
+#{
+#       vrda_scaled_logr(*n,x,y,b);
+#}
+.globl __vrda_scaled_logr__
+    .type   __vrda_scaled_logr__,@function
+__vrda_scaled_logr__:
+    mov		(%rdi),%edi
+    movlpd	(%rcx),%xmm0
+
+# C interface parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+# xmm0 - double b
+
+    .align 16
+    .p2align 4,,15
+.globl vrda_scaled_logr
+    .type   vrda_scaled_logr,@function
+vrda_scaled_logr:
+	sub		$stack_size,%rsp
+
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+
+# move the scale and shift factor to another register
+	movsd		%xmm0,%xmm10
+	unpcklpd	%xmm10,%xmm10
+
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$2,%rax			# get number of iterations
+	jz		.L__vda_cleanup		# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi		# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movlpd	(%rsi),%xmm0
+	movhpd	8(%rsi),%xmm0
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+                movlpd  -16(%rsi),%xmm1
+                movhpd  -8(%rsi),%xmm1
+
+# compute the logs
+
+#	movdqa	%xmm0,p_x(%rsp)	# save the input values
+
+# use the  algorithm referenced in the itanic trancendental paper.
+
+# reduction
+#  compute r = x frcpa(x) - 1
+        movdqa  %xmm0,%xmm8
+        movdqa  %xmm1,%xmm9
+
+        call    __vrd4_frcpa@PLT
+        movdqa  %xmm8,%xmm4
+                movdqa  %xmm9,%xmm7
+# invert the exponent
+        psllq   $1,%xmm8
+                psllq   $1,%xmm9
+        mulpd   %xmm0,%xmm4                             # r
+                mulpd   %xmm1,%xmm7                     # r
+        movdqa  %xmm8,%xmm5
+        paddq   .L__mask_rup(%rip),%xmm8
+        psrlq   $53,%xmm8
+                movdqa  %xmm9,%xmm6
+                paddq   .L__mask_rup(%rip),%xmm6
+                psrlq   $53,%xmm6
+        psubq   .L__mask_3ff(%rip),%xmm8
+                psubq   .L__mask_3ff(%rip),%xmm6
+        pshufd  $0x058,%xmm8,%xmm8
+                pshufd  $0x058,%xmm6,%xmm6
+
+
+        subpd   .L__real_one(%rip),%xmm4
+                subpd   .L__real_one(%rip),%xmm7
+
+        cvtdq2pd        %xmm8,%xmm0             #N
+                cvtdq2pd        %xmm6,%xmm1             #N
+#	movdqa	%xmm8,%xmm0
+#	movdqa	%xmm6,%xmm1
+# compute index for table lookup. if 1/2 bit set, increment the index+exponent
+        psrlq   $42,%xmm5
+                psrlq   $42,%xmm9
+        paddq   .L__int_one(%rip),%xmm5
+                paddq   .L__int_one(%rip),%xmm9
+        psrlq   $1,%xmm5
+                psrlq   $1,%xmm9
+        pand    .L__mask_3ff(%rip),%xmm5
+                pand    .L__mask_3ff(%rip),%xmm9
+        psllq   $1,%xmm5
+                psllq   $1,%xmm9
+
+        movdqa  %xmm5,p_x(%rsp) # move the indexes to a memory location
+                movdqa  %xmm9,p_x2(%rsp)
+
+
+        movapd  .L__real_third(%rip),%xmm3
+                movdqa  %xmm3,%xmm5
+        movapd  %xmm4,%xmm2
+                movapd  %xmm7,%xmm8
+
+# approximation
+#  compute the polynomial
+#   p(r) = p1r^2+p2r^3+p3r^4+p4r^5
+
+        mulpd   %xmm4,%xmm2                     #r^2
+                mulpd   %xmm7,%xmm8                     #r^2
+
+        mulpd   %xmm4,%xmm3                     # 1/3r
+                mulpd   %xmm7,%xmm5                     # 1/3r
+# lookup the f(k) term
+        lea             .L__np_lnf_table(%rip),%rdx
+        mov             p_x(%rsp),%rcx
+        mov             p_x+8(%rsp),%r9
+        movlpd          (%rdx,%rcx,8),%xmm6     # lookup
+        movhpd          (%rdx,%r9,8),%xmm6      # lookup
+
+        addpd   .L__real_half(%rip),%xmm3  # p2 + p3r
+                addpd   .L__real_half(%rip),%xmm5  # p2 + p3r
+
+                mov             p_x2(%rsp),%rcx
+                mov             p_x2+8(%rsp),%r9
+                movlpd          (%rdx,%rcx,8),%xmm9     # lookup
+                movhpd          (%rdx,%r9,8),%xmm9      # lookup
+
+        mulpd   %xmm3,%xmm2                     # r2(p2 + p3r)
+                mulpd   %xmm5,%xmm8                     # r2(p2 + p3r)
+        addpd   %xmm4,%xmm2                     # +r
+                addpd   %xmm7,%xmm8                     # +r
+
+
+#       reconstruction
+#  compute ln(x) = T + r + p(r) where
+#   T = N*ln(2)+ln(1/frcpa(x)) via tab of ln(1/frcpa(y)), where y = 1 + k/256, 0<=k<=255
+
+        mulpd   .L__real_log2(%rip),%xmm0        # compute  N*__real_log2
+                mulpd   .L__real_log2(%rip),%xmm1        # compute  N*__real_log2
+        addpd   %xmm6,%xmm2     # add the new mantissas
+                addpd   %xmm9,%xmm8     # add the new mantissas
+        addpd   %xmm2,%xmm0
+                addpd   %xmm8,%xmm1
+
+
+# store the result _m128d
+	mov		save_ya(%rsp),%rdi	# get y_array pointer
+	mulpd	%xmm10,%xmm0
+	movlpd	%xmm0,(%rdi)
+	movhpd	 %xmm0,8(%rdi)
+
+
+	prefetch	64(%rdi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+		mulpd	%xmm10,%xmm1
+		movlpd	%xmm1,-16(%rdi)
+		movhpd	%xmm1,-8(%rdi)
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vda_top
+
+
+# see if we need to do any extras
+	mov		save_nv(%rsp),%rax	# get number of values
+	test		%rax,%rax
+	jnz		.L__vda_cleanup
+
+
+.L__finish:
+	add		$stack_size,%rsp
+	ret
+
+	.align	16
+
+
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+#  we assume that rdx is pointing at the next x array element,
+#  r8 at the next y array element.  The number of values left is in
+#  save_nv
+.L__vda_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__finish              # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorpd		%xmm0,%xmm0
+	movlpd	%xmm0,p_x+8(%rsp)
+	movapd	%xmm0,p_x+16(%rsp)
+
+	mov		(%rsi),%rcx			# we know there's at least one
+	mov	 	%rcx,p_x(%rsp)
+	cmp		$2,%rax
+	jl		.L__vdacg
+
+	mov		8(%rsi),%rcx			# do the second value
+	mov	 	%rcx,p_x+8(%rsp)
+	cmp		$3,%rax
+	jl		.L__vdacg
+
+	mov		16(%rsi),%rcx			# do the third value
+	mov	 	%rcx,p_x+16(%rsp)
+
+.L__vdacg:
+	mov		$4,%rdi				# parameter for N
+	lea		p_x(%rsp),%rsi		# &x parameter
+	lea		p2_temp(%rsp),%rdx	# &y parameter
+	movsd		%xmm10,%xmm0
+	call		vrda_scaled_logr@PLT		# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		save_ya(%rsp),%rdi
+	mov		save_nv(%rsp),%rax	# get number of values
+	mov	 	p2_temp(%rsp),%rcx
+	mov		%rcx,(%rdi)			# we know there's at least one
+	cmp		$2,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+8(%rsp),%rcx
+	mov		%rcx,8(%rdi)			# do the second value
+	cmp		$3,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+16(%rsp),%rcx
+	mov		%rcx,16(%rdi)			# do the third value
+
+.L__vdacgf:
+	jmp		.L__finish
+
+	.data
+	.align	64
+
+.L__real_one:		.quad 0x03ff0000000000000	# 1.0
+			.quad 0x03ff0000000000000
+
+.L__real_half:		.quad 0x0bfe0000000000000	# 1/2
+			.quad 0x0bfe0000000000000
+.L__real_third:		.quad 0x03fd5555555555555	# 1/3
+			.quad 0x03fd5555555555555
+.L__real_fourth:	.quad 0x0bfd0000000000000	# 1/4
+			.quad 0x0bfd0000000000000
+
+.L__real_log2:	        .quad 0x03FE62E42FEFA39EF  # 0.693147182465
+		        .quad 0x03FE62E42FEFA39EF
+
+.L__mask_3ff:		.quad 0x000000000000003ff	#
+			.quad 0x000000000000003ff
+
+.L__mask_rup:           .quad 0x0000003fffffffffe
+                        .quad 0x0000003fffffffffe
+
+.L__int_one:		.quad 0x00000000000000001
+			.quad 0x00000000000000001
+
+
+
+
+.L__np_lnf_table:
+#log table Program - logtab.c
+#Built Jan 18 2006  09:51:57
+#Compiler version  1400
+
+    .quad 0x00000000000000000  # 0.000000000000 0
+    .quad 0x00000000000000000
+    .quad 0x03F50020055655885  # 0.000977039648 1
+    .quad 0x03F50020055655885
+    .quad 0x03F60040155D5881E  # 0.001955034836 2
+    .quad 0x03F60040155D5881E
+    .quad 0x03F6809048289860A  # 0.002933987435 3
+    .quad 0x03F6809048289860A
+    .quad 0x03F70080559588B25  # 0.003913899321 4
+    .quad 0x03F70080559588B25
+    .quad 0x03F740C8A7478788D  # 0.004894772377 5
+    .quad 0x03F740C8A7478788D
+    .quad 0x03F78121214586B02  # 0.005876608489 6
+    .quad 0x03F78121214586B02
+    .quad 0x03F7C189CBB0E283F  # 0.006859409551 7
+    .quad 0x03F7C189CBB0E283F
+    .quad 0x03F8010157588DE69  # 0.007843177461 8
+    .quad 0x03F8010157588DE69
+    .quad 0x03F82145E939EF1BC  # 0.008827914124 9
+    .quad 0x03F82145E939EF1BC
+    .quad 0x03F83D8896A83D7A8  # 0.009690354884 10
+    .quad 0x03F83D8896A83D7A8
+    .quad 0x03F85DDC705054DFF  # 0.010676913110 11
+    .quad 0x03F85DDC705054DFF
+    .quad 0x03F87E38762CA0C6D  # 0.011664445593 12
+    .quad 0x03F87E38762CA0C6D
+    .quad 0x03F89E9CAC6007563  # 0.012652954261 13
+    .quad 0x03F89E9CAC6007563
+    .quad 0x03F8BF091710935A4  # 0.013642441046 14
+    .quad 0x03F8BF091710935A4
+    .quad 0x03F8DF7DBA6777895  # 0.014632907884 15
+    .quad 0x03F8DF7DBA6777895
+    .quad 0x03F8FBEA8B13C03F9  # 0.015500371846 16
+    .quad 0x03F8FBEA8B13C03F9
+    .quad 0x03F90E3751F24F45C  # 0.016492681528 17
+    .quad 0x03F90E3751F24F45C
+    .quad 0x03F91E7D80B1FBF4C  # 0.017485976867 18
+    .quad 0x03F91E7D80B1FBF4C
+    .quad 0x03F92CBE4F6CC56C3  # 0.018355920375 19
+    .quad 0x03F92CBE4F6CC56C3
+    .quad 0x03F93D0C443D7258C  # 0.019351069108 20
+    .quad 0x03F93D0C443D7258C
+    .quad 0x03F94D5E6176ACC89  # 0.020347209148 21
+    .quad 0x03F94D5E6176ACC89
+    .quad 0x03F95DB4A937DEF10  # 0.021344342472 22
+    .quad 0x03F95DB4A937DEF10
+    .quad 0x03F96C039490E37F4  # 0.022217650494 23
+    .quad 0x03F96C039490E37F4
+    .quad 0x03F97C61B1CF5DED7  # 0.023216651576 24
+    .quad 0x03F97C61B1CF5DED7
+    .quad 0x03F98AB77B3FD6EAD  # 0.024091596947 25
+    .quad 0x03F98AB77B3FD6EAD
+    .quad 0x03F99B1D75828E780  # 0.025092472797 26
+    .quad 0x03F99B1D75828E780
+    .quad 0x03F9AB87A478CB7CB  # 0.026094351403 27
+    .quad 0x03F9AB87A478CB7CB
+    .quad 0x03F9B9E8027E1916F  # 0.026971819338 28
+    .quad 0x03F9B9E8027E1916F
+    .quad 0x03F9CA5A1A18613E6  # 0.027975583538 29
+    .quad 0x03F9CA5A1A18613E6
+    .quad 0x03F9D8C1670325921  # 0.028854704473 30
+    .quad 0x03F9D8C1670325921
+    .quad 0x03F9E93B6EE41F674  # 0.029860361378 31
+    .quad 0x03F9E93B6EE41F674
+    .quad 0x03F9F7A9B16782855  # 0.030741141554 32
+    .quad 0x03F9F7A9B16782855
+    .quad 0x03FA0415D89E74440  # 0.031748698315 33
+    .quad 0x03FA0415D89E74440
+    .quad 0x03FA0C58FA19DFAAB  # 0.032757271269 34
+    .quad 0x03FA0C58FA19DFAAB
+    .quad 0x03FA139577CC41C1A  # 0.033640607815 35
+    .quad 0x03FA139577CC41C1A
+    .quad 0x03FA1AD398C6CD57C  # 0.034524725334 36
+    .quad 0x03FA1AD398C6CD57C
+    .quad 0x03FA231C9C40E204E  # 0.035536103423 37
+    .quad 0x03FA231C9C40E204E
+    .quad 0x03FA2A5E4231CF7BD  # 0.036421899115 38
+    .quad 0x03FA2A5E4231CF7BD
+    .quad 0x03FA32AB4D4C59CB0  # 0.037435198758 39
+    .quad 0x03FA32AB4D4C59CB0
+    .quad 0x03FA39F07BA0EBD5A  # 0.038322679007 40
+    .quad 0x03FA39F07BA0EBD5A
+    .quad 0x03FA424192495D571  # 0.039337907520 41
+    .quad 0x03FA424192495D571
+    .quad 0x03FA498A4C73DA65D  # 0.040227078744 42
+    .quad 0x03FA498A4C73DA65D
+    .quad 0x03FA50D4AF75CA86F  # 0.041117041297 43
+    .quad 0x03FA50D4AF75CA86F
+    .quad 0x03FA592BBC15215BC  # 0.042135112141 44
+    .quad 0x03FA592BBC15215BC
+    .quad 0x03FA6079B00423FF6  # 0.043026775152 45
+    .quad 0x03FA6079B00423FF6
+    .quad 0x03FA67C94F2D4BB65  # 0.043919233935 46
+    .quad 0x03FA67C94F2D4BB65
+    .quad 0x03FA70265A550E77B  # 0.044940163069 47
+    .quad 0x03FA70265A550E77B
+    .quad 0x03FA77798F8D6DFDC  # 0.045834331871 48
+    .quad 0x03FA77798F8D6DFDC
+    .quad 0x03FA7ECE7267CD123  # 0.046729300926 49
+    .quad 0x03FA7ECE7267CD123
+    .quad 0x03FA873184BC09586  # 0.047753104446 50
+    .quad 0x03FA873184BC09586
+    .quad 0x03FA8E8A02D2E3175  # 0.048649793163 51
+    .quad 0x03FA8E8A02D2E3175
+    .quad 0x03FA95E430F8CE456  # 0.049547286652 52
+    .quad 0x03FA95E430F8CE456
+    .quad 0x03FA9D400FF482586  # 0.050445586359 53
+    .quad 0x03FA9D400FF482586
+    .quad 0x03FAA5AB21CB34A9E  # 0.051473203662 54
+    .quad 0x03FAA5AB21CB34A9E
+    .quad 0x03FAAD0AA2E784EF4  # 0.052373235867 55
+    .quad 0x03FAAD0AA2E784EF4
+    .quad 0x03FAB46BD74DA76A0  # 0.053274078860 56
+    .quad 0x03FAB46BD74DA76A0
+    .quad 0x03FABBCEBFC68F424  # 0.054175734102 57
+    .quad 0x03FABBCEBFC68F424
+    .quad 0x03FAC3335D1BBAE4D  # 0.055078203060 58
+    .quad 0x03FAC3335D1BBAE4D
+    .quad 0x03FACBA87200EB8F1  # 0.056110594428 59
+    .quad 0x03FACBA87200EB8F1
+    .quad 0x03FAD310BA20455A2  # 0.057014812019 60
+    .quad 0x03FAD310BA20455A2
+    .quad 0x03FADA7AB998B77ED  # 0.057919847959 61
+    .quad 0x03FADA7AB998B77ED
+    .quad 0x03FAE1E6713606CFB  # 0.058825703731 62
+    .quad 0x03FAE1E6713606CFB
+    .quad 0x03FAE953E1C48603A  # 0.059732380822 63
+    .quad 0x03FAE953E1C48603A
+    .quad 0x03FAF0C30C1116351  # 0.060639880722 64
+    .quad 0x03FAF0C30C1116351
+    .quad 0x03FAF833F0E927711  # 0.061548204926 65
+    .quad 0x03FAF833F0E927711
+    .quad 0x03FAFFA6911AB9309  # 0.062457354934 66
+    .quad 0x03FAFFA6911AB9309
+    .quad 0x03FB038D76BA2D737  # 0.063367332247 67
+    .quad 0x03FB038D76BA2D737
+    .quad 0x03FB0748836296412  # 0.064278138373 68
+    .quad 0x03FB0748836296412
+    .quad 0x03FB0B046EEE6F7A4  # 0.065189774824 69
+    .quad 0x03FB0B046EEE6F7A4
+    .quad 0x03FB0EC139C5DA5FD  # 0.066102243114 70
+    .quad 0x03FB0EC139C5DA5FD
+    .quad 0x03FB127EE451413A8  # 0.067015544762 71
+    .quad 0x03FB127EE451413A8
+    .quad 0x03FB163D6EF9579FC  # 0.067929681294 72
+    .quad 0x03FB163D6EF9579FC
+    .quad 0x03FB19FCDA271ABC0  # 0.068844654235 73
+    .quad 0x03FB19FCDA271ABC0
+    .quad 0x03FB1DBD2643D1912  # 0.069760465119 74
+    .quad 0x03FB1DBD2643D1912
+    .quad 0x03FB217E53B90D3CE  # 0.070677115481 75
+    .quad 0x03FB217E53B90D3CE
+    .quad 0x03FB254062F0A9417  # 0.071594606862 76
+    .quad 0x03FB254062F0A9417
+    .quad 0x03FB29035454CBCB0  # 0.072512940806 77
+    .quad 0x03FB29035454CBCB0
+    .quad 0x03FB2CC7284FE5F1A  # 0.073432118863 78
+    .quad 0x03FB2CC7284FE5F1A
+    .quad 0x03FB308BDF4CB4062  # 0.074352142586 79
+    .quad 0x03FB308BDF4CB4062
+    .quad 0x03FB345179B63DD3F  # 0.075273013532 80
+    .quad 0x03FB345179B63DD3F
+    .quad 0x03FB3817F7F7D6EAB  # 0.076194733263 81
+    .quad 0x03FB3817F7F7D6EAB
+    .quad 0x03FB3BDF5A7D1EE5E  # 0.077117303344 82
+    .quad 0x03FB3BDF5A7D1EE5E
+    .quad 0x03FB3F1D405CE86D3  # 0.077908755701 83
+    .quad 0x03FB3F1D405CE86D3
+    .quad 0x03FB42E64BEC266E4  # 0.078832909176 84
+    .quad 0x03FB42E64BEC266E4
+    .quad 0x03FB46B03CF437BC4  # 0.079757917501 85
+    .quad 0x03FB46B03CF437BC4
+    .quad 0x03FB4A7B13E1E3E65  # 0.080683782259 86
+    .quad 0x03FB4A7B13E1E3E65
+    .quad 0x03FB4E46D1223FE84  # 0.081610505036 87
+    .quad 0x03FB4E46D1223FE84
+    .quad 0x03FB52137522AE732  # 0.082538087426 88
+    .quad 0x03FB52137522AE732
+    .quad 0x03FB5555DE434F2A0  # 0.083333843436 89
+    .quad 0x03FB5555DE434F2A0
+    .quad 0x03FB59242FF043D34  # 0.084263026485 90
+    .quad 0x03FB59242FF043D34
+    .quad 0x03FB5CF36997817B2  # 0.085193073719 91
+    .quad 0x03FB5CF36997817B2
+    .quad 0x03FB60C38BA799459  # 0.086123986746 92
+    .quad 0x03FB60C38BA799459
+    .quad 0x03FB6408F471C82A2  # 0.086922602521 93
+    .quad 0x03FB6408F471C82A2
+    .quad 0x03FB67DAC7466CB96  # 0.087855127734 94
+    .quad 0x03FB67DAC7466CB96
+    .quad 0x03FB6BAD83C1883BA  # 0.088788523361 95
+    .quad 0x03FB6BAD83C1883BA
+    .quad 0x03FB6EF528C056A2D  # 0.089589270768 96
+    .quad 0x03FB6EF528C056A2D
+    .quad 0x03FB72C9985035BB1  # 0.090524287199 97
+    .quad 0x03FB72C9985035BB1
+    .quad 0x03FB769EF2C6B5688  # 0.091460178704 98
+    .quad 0x03FB769EF2C6B5688
+    .quad 0x03FB79E8D70A364C6  # 0.092263069152 99
+    .quad 0x03FB79E8D70A364C6
+    .quad 0x03FB7DBFE6EA733FE  # 0.093200590148 100
+    .quad 0x03FB7DBFE6EA733FE
+    .quad 0x03FB8197E2F40E3F0  # 0.094138990914 101
+    .quad 0x03FB8197E2F40E3F0
+    .quad 0x03FB84E40992A4804  # 0.094944035906 102
+    .quad 0x03FB84E40992A4804
+    .quad 0x03FB88BDBD5FC66D2  # 0.095884074919 103
+    .quad 0x03FB88BDBD5FC66D2
+    .quad 0x03FB8C985E9B9EC7E  # 0.096824998438 104
+    .quad 0x03FB8C985E9B9EC7E
+    .quad 0x03FB8FE6CAB20E979  # 0.097632209567 105
+    .quad 0x03FB8FE6CAB20E979
+    .quad 0x03FB93C3261014C65  # 0.098574780162 106
+    .quad 0x03FB93C3261014C65
+    .quad 0x03FB97130DC9235DE  # 0.099383405543 107
+    .quad 0x03FB97130DC9235DE
+    .quad 0x03FB9AF124D64C623  # 0.100327628989 108
+    .quad 0x03FB9AF124D64C623
+    .quad 0x03FB9E4289871E964  # 0.101137673586 109
+    .quad 0x03FB9E4289871E964
+    .quad 0x03FBA2225DD276FCB  # 0.102083555691 110
+    .quad 0x03FBA2225DD276FCB
+    .quad 0x03FBA57540D1FE441  # 0.102895024494 111
+    .quad 0x03FBA57540D1FE441
+    .quad 0x03FBA956D3ECADE60  # 0.103842571097 112
+    .quad 0x03FBA956D3ECADE60
+    .quad 0x03FBACAB3693AB9C0  # 0.104655469123 113
+    .quad 0x03FBACAB3693AB9C0
+    .quad 0x03FBB08E8A10F96F4  # 0.105604686090 114
+    .quad 0x03FBB08E8A10F96F4
+    .quad 0x03FBB3E46DBA02181  # 0.106419018383 115
+    .quad 0x03FBB3E46DBA02181
+    .quad 0x03FBB7C9832F58018  # 0.107369911615 116
+    .quad 0x03FBB7C9832F58018
+    .quad 0x03FBBB20E936D6976  # 0.108185683244 117
+    .quad 0x03FBBB20E936D6976
+    .quad 0x03FBBF07C23BC54EA  # 0.109138258671 118
+    .quad 0x03FBBF07C23BC54EA
+    .quad 0x03FBC260ABFFFE972  # 0.109955474734 119
+    .quad 0x03FBC260ABFFFE972
+    .quad 0x03FBC6494A2E418A0  # 0.110909738320 120
+    .quad 0x03FBC6494A2E418A0
+    .quad 0x03FBC9A3B90F57748  # 0.111728403941 121
+    .quad 0x03FBC9A3B90F57748
+    .quad 0x03FBCCFEDBFEE13A8  # 0.112547740324 122
+    .quad 0x03FBCCFEDBFEE13A8
+    .quad 0x03FBD0EA1362CDBFC  # 0.113504482008 123
+    .quad 0x03FBD0EA1362CDBFC
+    .quad 0x03FBD446BD753D433  # 0.114325275488 124
+    .quad 0x03FBD446BD753D433
+    .quad 0x03FBD7A41C8627307  # 0.115146743223 125
+    .quad 0x03FBD7A41C8627307
+    .quad 0x03FBDB91F09680DF9  # 0.116105975911 126
+    .quad 0x03FBDB91F09680DF9
+    .quad 0x03FBDEF0D8D466DBB  # 0.116928908339 127
+    .quad 0x03FBDEF0D8D466DBB
+    .quad 0x03FBE2507702AF03B  # 0.117752518544 128
+    .quad 0x03FBE2507702AF03B
+    .quad 0x03FBE640EB3D2B411  # 0.118714255240 129
+    .quad 0x03FBE640EB3D2B411
+    .quad 0x03FBE9A214A69DD58  # 0.119539337795 130
+    .quad 0x03FBE9A214A69DD58
+    .quad 0x03FBED03F4F440969  # 0.120365101673 131
+    .quad 0x03FBED03F4F440969
+    .quad 0x03FBF0F70CDD992E4  # 0.121329355484 132
+    .quad 0x03FBF0F70CDD992E4
+    .quad 0x03FBF45A7A78B7C3B  # 0.122156599431 133
+    .quad 0x03FBF45A7A78B7C3B
+    .quad 0x03FBF7BE9FEDBFDED  # 0.122984528276 134
+    .quad 0x03FBF7BE9FEDBFDED
+    .quad 0x03FBFB237D8AB13FB  # 0.123813143156 135
+    .quad 0x03FBFB237D8AB13FB
+    .quad 0x03FBFF1A13EAC95FD  # 0.124780729104 136
+    .quad 0x03FBFF1A13EAC95FD
+    .quad 0x03FC014040CAB0229  # 0.125610834299 137
+    .quad 0x03FC014040CAB0229
+    .quad 0x03FC02F3D4301417B  # 0.126441629140 138
+    .quad 0x03FC02F3D4301417B
+    .quad 0x03FC04A7C44CF87A4  # 0.127273114776 139
+    .quad 0x03FC04A7C44CF87A4
+    .quad 0x03FC06A4D1D26C5E9  # 0.128244055971 140
+    .quad 0x03FC06A4D1D26C5E9
+    .quad 0x03FC08598B59E3A07  # 0.129077042275 141
+    .quad 0x03FC08598B59E3A07
+    .quad 0x03FC0A0EA2164AF02  # 0.129910723024 142
+    .quad 0x03FC0A0EA2164AF02
+    .quad 0x03FC0BC4162F73B66  # 0.130745099376 143
+    .quad 0x03FC0BC4162F73B66
+    .quad 0x03FC0D79E7CD48E58  # 0.131580172493 144
+    .quad 0x03FC0D79E7CD48E58
+    .quad 0x03FC0F301717CF0FB  # 0.132415943541 145
+    .quad 0x03FC0F301717CF0FB
+    .quad 0x03FC10E6A437247B7  # 0.133252413686 146
+    .quad 0x03FC10E6A437247B7
+    .quad 0x03FC12E6BFA8FEAD6  # 0.134229180665 147
+    .quad 0x03FC12E6BFA8FEAD6
+    .quad 0x03FC149E189F8642E  # 0.135067169541 148
+    .quad 0x03FC149E189F8642E
+    .quad 0x03FC1655CFEA923A4  # 0.135905861231 149
+    .quad 0x03FC1655CFEA923A4
+    .quad 0x03FC180DE5B2ACE5C  # 0.136745256915 150
+    .quad 0x03FC180DE5B2ACE5C
+    .quad 0x03FC19C65A207AC07  # 0.137585357777 151
+    .quad 0x03FC19C65A207AC07
+    .quad 0x03FC1B7F2D5CBA842  # 0.138426165001 152
+    .quad 0x03FC1B7F2D5CBA842
+    .quad 0x03FC1D385F90453F2  # 0.139267679777 153
+    .quad 0x03FC1D385F90453F2
+    .quad 0x03FC1EF1F0E40E6CD  # 0.140109903297 154
+    .quad 0x03FC1EF1F0E40E6CD
+    .quad 0x03FC20ABE18124098  # 0.140952836755 155
+    .quad 0x03FC20ABE18124098
+    .quad 0x03FC22663190AEACC  # 0.141796481350 156
+    .quad 0x03FC22663190AEACC
+    .quad 0x03FC2420E13BF19E3  # 0.142640838281 157
+    .quad 0x03FC2420E13BF19E3
+    .quad 0x03FC25DBF0AC4AED2  # 0.143485908754 158
+    .quad 0x03FC25DBF0AC4AED2
+    .quad 0x03FC2797600B3387B  # 0.144331693975 159
+    .quad 0x03FC2797600B3387B
+    .quad 0x03FC29532F823F525  # 0.145178195155 160
+    .quad 0x03FC29532F823F525
+    .quad 0x03FC2B0F5F3B1D3EF  # 0.146025413505 161
+    .quad 0x03FC2B0F5F3B1D3EF
+    .quad 0x03FC2CCBEF5F97653  # 0.146873350243 162
+    .quad 0x03FC2CCBEF5F97653
+    .quad 0x03FC2E88E01993187  # 0.147722006588 163
+    .quad 0x03FC2E88E01993187
+    .quad 0x03FC3046319311009  # 0.148571383763 164
+    .quad 0x03FC3046319311009
+    .quad 0x03FC3203E3F62D328  # 0.149421482992 165
+    .quad 0x03FC3203E3F62D328
+    .quad 0x03FC33C1F76D1F469  # 0.150272305505 166
+    .quad 0x03FC33C1F76D1F469
+    .quad 0x03FC35806C223A70F  # 0.151123852534 167
+    .quad 0x03FC35806C223A70F
+    .quad 0x03FC373F423FED9A1  # 0.151976125313 168
+    .quad 0x03FC373F423FED9A1
+    .quad 0x03FC38FE79F0C3771  # 0.152829125080 169
+    .quad 0x03FC38FE79F0C3771
+    .quad 0x03FC3ABE135F62A12  # 0.153682853077 170
+    .quad 0x03FC3ABE135F62A12
+    .quad 0x03FC3C335E0447D71  # 0.154394850259 171
+    .quad 0x03FC3C335E0447D71
+    .quad 0x03FC3DF3AB13505F9  # 0.155249916579 172
+    .quad 0x03FC3DF3AB13505F9
+    .quad 0x03FC3FB45A59928CA  # 0.156105714663 173
+    .quad 0x03FC3FB45A59928CA
+    .quad 0x03FC41756C0220C81  # 0.156962245765 174
+    .quad 0x03FC41756C0220C81
+    .quad 0x03FC4336E03829D61  # 0.157819511141 175
+    .quad 0x03FC4336E03829D61
+    .quad 0x03FC44F8B726F8EFE  # 0.158677512051 176
+    .quad 0x03FC44F8B726F8EFE
+    .quad 0x03FC46BAF0F9F5DB8  # 0.159536249760 177
+    .quad 0x03FC46BAF0F9F5DB8
+    .quad 0x03FC48326CD3EC797  # 0.160252428262 178
+    .quad 0x03FC48326CD3EC797
+    .quad 0x03FC49F55C6502F81  # 0.161112520058 179
+    .quad 0x03FC49F55C6502F81
+    .quad 0x03FC4BB8AF55DE908  # 0.161973352249 180
+    .quad 0x03FC4BB8AF55DE908
+    .quad 0x03FC4D7C65D25566D  # 0.162834926111 181
+    .quad 0x03FC4D7C65D25566D
+    .quad 0x03FC4F4080065AA7F  # 0.163697242922 182
+    .quad 0x03FC4F4080065AA7F
+    .quad 0x03FC50B98CD30A759  # 0.164416408720 183
+    .quad 0x03FC50B98CD30A759
+    .quad 0x03FC527E5E4A1B58D  # 0.165280090939 184
+    .quad 0x03FC527E5E4A1B58D
+    .quad 0x03FC544393F5DF80F  # 0.166144519750 185
+    .quad 0x03FC544393F5DF80F
+    .quad 0x03FC56092E02BA514  # 0.167009696444 186
+    .quad 0x03FC56092E02BA514
+    .quad 0x03FC57837B3098F2C  # 0.167731249257 187
+    .quad 0x03FC57837B3098F2C
+    .quad 0x03FC5949CDB873419  # 0.168597800437 188
+    .quad 0x03FC5949CDB873419
+    .quad 0x03FC5B10851FC924A  # 0.169465103180 189
+    .quad 0x03FC5B10851FC924A
+    .quad 0x03FC5C8BC079D8289  # 0.170188430518 190
+    .quad 0x03FC5C8BC079D8289
+    .quad 0x03FC5E533144C1718  # 0.171057114516 191
+    .quad 0x03FC5E533144C1718
+    .quad 0x03FC601B076E7A8A8  # 0.171926553783 192
+    .quad 0x03FC601B076E7A8A8
+    .quad 0x03FC619732215D786  # 0.172651664394 193
+    .quad 0x03FC619732215D786
+    .quad 0x03FC635FC298F6C77  # 0.173522491735 194
+    .quad 0x03FC635FC298F6C77
+    .quad 0x03FC6528B8EFA5D16  # 0.174394078077 195
+    .quad 0x03FC6528B8EFA5D16
+    .quad 0x03FC66A5D42A3AD33  # 0.175120980777 196
+    .quad 0x03FC66A5D42A3AD33
+    .quad 0x03FC686F85BAD4298  # 0.175993962063 197
+    .quad 0x03FC686F85BAD4298
+    .quad 0x03FC6A399DABBD383  # 0.176867706111 198
+    .quad 0x03FC6A399DABBD383
+    .quad 0x03FC6BB7AA9F22C40  # 0.177596409780 199
+    .quad 0x03FC6BB7AA9F22C40
+    .quad 0x03FC6D827EB7C1E57  # 0.178471555693 200
+    .quad 0x03FC6D827EB7C1E57
+    .quad 0x03FC6F0128B756AB9  # 0.179201429458 201
+    .quad 0x03FC6F0128B756AB9
+    .quad 0x03FC70CCB9927BCF6  # 0.180077981742 202
+    .quad 0x03FC70CCB9927BCF6
+    .quad 0x03FC7298B1A4E32B6  # 0.180955303044 203
+    .quad 0x03FC7298B1A4E32B6
+    .quad 0x03FC74184F58CC7DC  # 0.181686992547 204
+    .quad 0x03FC74184F58CC7DC
+    .quad 0x03FC75E5051E74141  # 0.182565727226 205
+    .quad 0x03FC75E5051E74141
+    .quad 0x03FC77654128F6127  # 0.183298596442 206
+    .quad 0x03FC77654128F6127
+    .quad 0x03FC7932B53E97639  # 0.184178749058 207
+    .quad 0x03FC7932B53E97639
+    .quad 0x03FC7AB390229D8FD  # 0.184912801796 208
+    .quad 0x03FC7AB390229D8FD
+    .quad 0x03FC7C81C325B4A5E  # 0.185794376934 209
+    .quad 0x03FC7C81C325B4A5E
+    .quad 0x03FC7E033D66CD24A  # 0.186529617023 210
+    .quad 0x03FC7E033D66CD24A
+    .quad 0x03FC7FD22FF599D4C  # 0.187412619288 211
+    .quad 0x03FC7FD22FF599D4C
+    .quad 0x03FC81544A17F67C1  # 0.188149050576 212
+    .quad 0x03FC81544A17F67C1
+    .quad 0x03FC8323FCD17DAC8  # 0.189033484595 213
+    .quad 0x03FC8323FCD17DAC8
+    .quad 0x03FC84A6B759F512D  # 0.189771110947 214
+    .quad 0x03FC84A6B759F512D
+    .quad 0x03FC86772ADE0201C  # 0.190656981373 215
+    .quad 0x03FC86772ADE0201C
+    .quad 0x03FC87FA865210911  # 0.191395806674 216
+    .quad 0x03FC87FA865210911
+    .quad 0x03FC89CBBB4136201  # 0.192283118179 217
+    .quad 0x03FC89CBBB4136201
+    .quad 0x03FC8B4FB826FF291  # 0.193023146334 218
+    .quad 0x03FC8B4FB826FF291
+    .quad 0x03FC8D21AF2299298  # 0.193911903613 219
+    .quad 0x03FC8D21AF2299298
+    .quad 0x03FC8EA64E00E7FC0  # 0.194653138545 220
+    .quad 0x03FC8EA64E00E7FC0
+    .quad 0x03FC902B36AB7681D  # 0.195394923313 221
+    .quad 0x03FC902B36AB7681D
+    .quad 0x03FC91FE49096581E  # 0.196285791969 222
+    .quad 0x03FC91FE49096581E
+    .quad 0x03FC9383D471B869B  # 0.197028789254 223
+    .quad 0x03FC9383D471B869B
+    .quad 0x03FC9557AA6B87F65  # 0.197921115309 224
+    .quad 0x03FC9557AA6B87F65
+    .quad 0x03FC96DDD91A0B959  # 0.198665329082 225
+    .quad 0x03FC96DDD91A0B959
+    .quad 0x03FC9864522D04491  # 0.199410097121 226
+    .quad 0x03FC9864522D04491
+    .quad 0x03FC9A3945D1A44B3  # 0.200304551564 227
+    .quad 0x03FC9A3945D1A44B3
+    .quad 0x03FC9BC062F26FC3B  # 0.201050541900 228
+    .quad 0x03FC9BC062F26FC3B
+    .quad 0x03FC9D47CAD2C1871  # 0.201797089154 229
+    .quad 0x03FC9D47CAD2C1871
+    .quad 0x03FC9F1DDD7FE4F8B  # 0.202693682161 230
+    .quad 0x03FC9F1DDD7FE4F8B
+    .quad 0x03FCA0A5EA371A910  # 0.203441457564 231
+    .quad 0x03FCA0A5EA371A910
+    .quad 0x03FCA22E42098F498  # 0.204189792554 232
+    .quad 0x03FCA22E42098F498
+    .quad 0x03FCA405751F6CCE4  # 0.205088534376 233
+    .quad 0x03FCA405751F6CCE4
+    .quad 0x03FCA58E729348F40  # 0.205838103409 234
+    .quad 0x03FCA58E729348F40
+    .quad 0x03FCA717BB7EC64A3  # 0.206588234717 235
+    .quad 0x03FCA717BB7EC64A3
+    .quad 0x03FCA8F010601E5FD  # 0.207489135679 236
+    .quad 0x03FCA8F010601E5FD
+    .quad 0x03FCAA79FFB8FCD48  # 0.208240506966 237
+    .quad 0x03FCAA79FFB8FCD48
+    .quad 0x03FCAC043AE68965A  # 0.208992443238 238
+    .quad 0x03FCAC043AE68965A
+    .quad 0x03FCAD8EC205FB6AD  # 0.209744945343 239
+    .quad 0x03FCAD8EC205FB6AD
+    .quad 0x03FCAF6895610DBAD  # 0.210648695969 240
+    .quad 0x03FCAF6895610DBAD
+    .quad 0x03FCB0F3C3FBD65C9  # 0.211402445910 241
+    .quad 0x03FCB0F3C3FBD65C9
+    .quad 0x03FCB27F3EE674219  # 0.212156764419 242
+    .quad 0x03FCB27F3EE674219
+    .quad 0x03FCB40B063E65B0F  # 0.212911652354 243
+    .quad 0x03FCB40B063E65B0F
+    .quad 0x03FCB5E65A8096C88  # 0.213818270730 244
+    .quad 0x03FCB5E65A8096C88
+    .quad 0x03FCB772CA646760C  # 0.214574414434 245
+    .quad 0x03FCB772CA646760C
+    .quad 0x03FCB8FF871461198  # 0.215331130323 246
+    .quad 0x03FCB8FF871461198
+    .quad 0x03FCBA8C90AE4AD19  # 0.216088419265 247
+    .quad 0x03FCBA8C90AE4AD19
+    .quad 0x03FCBC19E74FFCBDA  # 0.216846282128 248
+    .quad 0x03FCBC19E74FFCBDA
+    .quad 0x03FCBDF71B83DAE7A  # 0.217756476365 249
+    .quad 0x03FCBDF71B83DAE7A
+    .quad 0x03FCBF851C067555C  # 0.218515604922 250
+    .quad 0x03FCBF851C067555C
+    .quad 0x03FCC11369F0CDB3C  # 0.219275310193 251
+    .quad 0x03FCC11369F0CDB3C
+    .quad 0x03FCC2A205610593E  # 0.220035593055 252
+    .quad 0x03FCC2A205610593E
+    .quad 0x03FCC430EE755023B  # 0.220796454387 253
+    .quad 0x03FCC430EE755023B
+    .quad 0x03FCC5C0254BF23A8  # 0.221557895069 254
+    .quad 0x03FCC5C0254BF23A8
+    .quad 0x03FCC79F9AB632BF1  # 0.222472389875 255
+    .quad 0x03FCC79F9AB632BF1
+    .quad 0x03FCC92F7D09ABE20  # 0.223235108240 256
+    .quad 0x03FCC92F7D09ABE20
+    .quad 0x03FCCABFAD80D023D  # 0.223998408788 257
+    .quad 0x03FCCABFAD80D023D
+    .quad 0x03FCCC502C3A2F1E8  # 0.224762292410 258
+    .quad 0x03FCCC502C3A2F1E8
+    .quad 0x03FCCDE0F9546A5E7  # 0.225526759995 259
+    .quad 0x03FCCDE0F9546A5E7
+    .quad 0x03FCCF7214EE356E9  # 0.226291812439 260
+    .quad 0x03FCCF7214EE356E9
+    .quad 0x03FCD1037F2655E7B  # 0.227057450635 261
+    .quad 0x03FCD1037F2655E7B
+    .quad 0x03FCD295381BA37E9  # 0.227823675483 262
+    .quad 0x03FCD295381BA37E9
+    .quad 0x03FCD4273FED08111  # 0.228590487882 263
+    .quad 0x03FCD4273FED08111
+    .quad 0x03FCD5B996B97FB5F  # 0.229357888733 264
+    .quad 0x03FCD5B996B97FB5F
+    .quad 0x03FCD74C3CA018C9C  # 0.230125878940 265
+    .quad 0x03FCD74C3CA018C9C
+    .quad 0x03FCD8DF31BFF3FF2  # 0.230894459410 266
+    .quad 0x03FCD8DF31BFF3FF2
+    .quad 0x03FCDA727638446A1  # 0.231663631050 267
+    .quad 0x03FCDA727638446A1
+    .quad 0x03FCDC56CAE452F5B  # 0.232587418645 268
+    .quad 0x03FCDC56CAE452F5B
+    .quad 0x03FCDDEABE5A3926E  # 0.233357894066 269
+    .quad 0x03FCDDEABE5A3926E
+    .quad 0x03FCDF7F018CE771F  # 0.234128963578 270
+    .quad 0x03FCDF7F018CE771F
+    .quad 0x03FCE113949BDEC62  # 0.234900628096 271
+    .quad 0x03FCE113949BDEC62
+    .quad 0x03FCE2A877A6B2C0F  # 0.235672888541 272
+    .quad 0x03FCE2A877A6B2C0F
+    .quad 0x03FCE43DAACD09BEC  # 0.236445745833 273
+    .quad 0x03FCE43DAACD09BEC
+    .quad 0x03FCE5D32E2E9CE87  # 0.237219200895 274
+    .quad 0x03FCE5D32E2E9CE87
+    .quad 0x03FCE76901EB38427  # 0.237993254653 275
+    .quad 0x03FCE76901EB38427
+    .quad 0x03FCE8ADE53F76866  # 0.238612929343 276
+    .quad 0x03FCE8ADE53F76866
+    .quad 0x03FCEA4449F04AAF4  # 0.239388063093 277
+    .quad 0x03FCEA4449F04AAF4
+    .quad 0x03FCEBDAFF5593E99  # 0.240163798141 278
+    .quad 0x03FCEBDAFF5593E99
+    .quad 0x03FCED72058F666C5  # 0.240940135421 279
+    .quad 0x03FCED72058F666C5
+    .quad 0x03FCEF095CBDE9937  # 0.241717075868 280
+    .quad 0x03FCEF095CBDE9937
+    .quad 0x03FCF0A1050157ED6  # 0.242494620422 281
+    .quad 0x03FCF0A1050157ED6
+    .quad 0x03FCF238FE79FF4BF  # 0.243272770021 282
+    .quad 0x03FCF238FE79FF4BF
+    .quad 0x03FCF3D1494840D2F  # 0.244051525609 283
+    .quad 0x03FCF3D1494840D2F
+    .quad 0x03FCF569E58C91077  # 0.244830888130 284
+    .quad 0x03FCF569E58C91077
+    .quad 0x03FCF702D36777DF0  # 0.245610858531 285
+    .quad 0x03FCF702D36777DF0
+    .quad 0x03FCF89C12F990D0C  # 0.246391437760 286
+    .quad 0x03FCF89C12F990D0C
+    .quad 0x03FCFA35A4638AE2C  # 0.247172626770 287
+    .quad 0x03FCFA35A4638AE2C
+    .quad 0x03FCFB7D86EEE3B92  # 0.247798017660 288
+    .quad 0x03FCFB7D86EEE3B92
+    .quad 0x03FCFD17ABFCDB683  # 0.248580306677 289
+    .quad 0x03FCFD17ABFCDB683
+    .quad 0x03FCFEB2233EA07CB  # 0.249363208150 290
+    .quad 0x03FCFEB2233EA07CB
+    .quad 0x03FD0026766A9671C  # 0.250146723037 291
+    .quad 0x03FD0026766A9671C
+    .quad 0x03FD00F40470C7323  # 0.250930852302 292
+    .quad 0x03FD00F40470C7323
+    .quad 0x03FD01C1BBC2735A3  # 0.251715596908 293
+    .quad 0x03FD01C1BBC2735A3
+    .quad 0x03FD028F9C7035C1D  # 0.252500957822 294
+    .quad 0x03FD028F9C7035C1D
+    .quad 0x03FD03346E0106062  # 0.253129690945 295
+    .quad 0x03FD03346E0106062
+    .quad 0x03FD0402994B4F041  # 0.253916163656 296
+    .quad 0x03FD0402994B4F041
+    .quad 0x03FD04D0EE20620AF  # 0.254703255393 297
+    .quad 0x03FD04D0EE20620AF
+    .quad 0x03FD059F6C910034D  # 0.255490967131 298
+    .quad 0x03FD059F6C910034D
+    .quad 0x03FD066E14ADF4BFD  # 0.256279299848 299
+    .quad 0x03FD066E14ADF4BFD
+    .quad 0x03FD07138604D5864  # 0.256910413785 300
+    .quad 0x03FD07138604D5864
+    .quad 0x03FD07E2794F3E8C1  # 0.257699866735 301
+    .quad 0x03FD07E2794F3E8C1
+    .quad 0x03FD08B196753A125  # 0.258489943414 302
+    .quad 0x03FD08B196753A125
+    .quad 0x03FD0980DD87BA2DD  # 0.259280644807 303
+    .quad 0x03FD0980DD87BA2DD
+    .quad 0x03FD0A504E97BB40C  # 0.260071971904 304
+    .quad 0x03FD0A504E97BB40C
+    .quad 0x03FD0AF660EB9E278  # 0.260705484754 305
+    .quad 0x03FD0AF660EB9E278
+    .quad 0x03FD0BC61DBBA97CB  # 0.261497940616 306
+    .quad 0x03FD0BC61DBBA97CB
+    .quad 0x03FD0C9604B8FC51E  # 0.262291024962 307
+    .quad 0x03FD0C9604B8FC51E
+    .quad 0x03FD0D3C7586CD5E5  # 0.262925945618 308
+    .quad 0x03FD0D3C7586CD5E5
+    .quad 0x03FD0E0CA89A72D29  # 0.263720163752 309
+    .quad 0x03FD0E0CA89A72D29
+    .quad 0x03FD0EDD060B78082  # 0.264515013170 310
+    .quad 0x03FD0EDD060B78082
+    .quad 0x03FD0FAD8DEB1E2C0  # 0.265310494876 311
+    .quad 0x03FD0FAD8DEB1E2C0
+    .quad 0x03FD10547F9D26ABC  # 0.265947336165 312
+    .quad 0x03FD10547F9D26ABC
+    .quad 0x03FD1125540925114  # 0.266743958529 313
+    .quad 0x03FD1125540925114
+    .quad 0x03FD11F653144CB8B  # 0.267541216005 314
+    .quad 0x03FD11F653144CB8B
+    .quad 0x03FD129DA43F5BE9E  # 0.268179479949 315
+    .quad 0x03FD129DA43F5BE9E
+    .quad 0x03FD136EF02E8290C  # 0.268977883185 316
+    .quad 0x03FD136EF02E8290C
+    .quad 0x03FD144066EDAE406  # 0.269776924378 317
+    .quad 0x03FD144066EDAE406
+    .quad 0x03FD14E817FF359D7  # 0.270416617347 318
+    .quad 0x03FD14E817FF359D7
+    .quad 0x03FD15B9DBFA9DEC8  # 0.271216809436 319
+    .quad 0x03FD15B9DBFA9DEC8
+    .quad 0x03FD168BCAF73B3EB  # 0.272017642345 320
+    .quad 0x03FD168BCAF73B3EB
+    .quad 0x03FD1733DC5D68DE8  # 0.272658770753 321
+    .quad 0x03FD1733DC5D68DE8
+    .quad 0x03FD180618EF18ADE  # 0.273460759729 322
+    .quad 0x03FD180618EF18ADE
+    .quad 0x03FD18D880B3826FE  # 0.274263392407 323
+    .quad 0x03FD18D880B3826FE
+    .quad 0x03FD1980F2DD42B6F  # 0.274905962710 324
+    .quad 0x03FD1980F2DD42B6F
+    .quad 0x03FD1A53A8902E70B  # 0.275709756661 325
+    .quad 0x03FD1A53A8902E70B
+    .quad 0x03FD1AFC59297024D  # 0.276353257326 326
+    .quad 0x03FD1AFC59297024D
+    .quad 0x03FD1BCF5D04AE1EA  # 0.277158215914 327
+    .quad 0x03FD1BCF5D04AE1EA
+    .quad 0x03FD1CA28C64BAE54  # 0.277963822983 328
+    .quad 0x03FD1CA28C64BAE54
+    .quad 0x03FD1D4B9E796C245  # 0.278608776246 329
+    .quad 0x03FD1D4B9E796C245
+    .quad 0x03FD1E1F1C5C3A06C  # 0.279415553216 330
+    .quad 0x03FD1E1F1C5C3A06C
+    .quad 0x03FD1EC86D5747AAD  # 0.280061443760 331
+    .quad 0x03FD1EC86D5747AAD
+    .quad 0x03FD1F9C39F74C559  # 0.280869394034 332
+    .quad 0x03FD1F9C39F74C559
+    .quad 0x03FD2070326F1F789  # 0.281677997620 333
+    .quad 0x03FD2070326F1F789
+    .quad 0x03FD2119E59F8789C  # 0.282325351583 334
+    .quad 0x03FD2119E59F8789C
+    .quad 0x03FD21EE2D300381C  # 0.283135133796 335
+    .quad 0x03FD21EE2D300381C
+    .quad 0x03FD22981FBEF797A  # 0.283783432036 336
+    .quad 0x03FD22981FBEF797A
+    .quad 0x03FD236CB6A339EED  # 0.284594396317 337
+    .quad 0x03FD236CB6A339EED
+    .quad 0x03FD2416E8C01F606  # 0.285243641592 338
+    .quad 0x03FD2416E8C01F606
+    .quad 0x03FD24EBCF3387FF6  # 0.286055791397 339
+    .quad 0x03FD24EBCF3387FF6
+    .quad 0x03FD2596410DF963A  # 0.286705986479 340
+    .quad 0x03FD2596410DF963A
+    .quad 0x03FD266B774C2AF55  # 0.287519325279 341
+    .quad 0x03FD266B774C2AF55
+    .quad 0x03FD27162913F873F  # 0.288170472950 342
+    .quad 0x03FD27162913F873F
+    .quad 0x03FD27EBAF58D8C9C  # 0.288985004232 343
+    .quad 0x03FD27EBAF58D8C9C
+    .quad 0x03FD2896A13E086A3  # 0.289637107288 344
+    .quad 0x03FD2896A13E086A3
+    .quad 0x03FD296C77C5C0E13  # 0.290452834554 345
+    .quad 0x03FD296C77C5C0E13
+    .quad 0x03FD2A17A9F88EDD2  # 0.291105895801 346
+    .quad 0x03FD2A17A9F88EDD2
+    .quad 0x03FD2AEDD0FF8CC2C  # 0.291922822568 347
+    .quad 0x03FD2AEDD0FF8CC2C
+    .quad 0x03FD2B9943B06BD77  # 0.292576844829 348
+    .quad 0x03FD2B9943B06BD77
+    .quad 0x03FD2C6FBB7360D0E  # 0.293394974630 349
+    .quad 0x03FD2C6FBB7360D0E
+    .quad 0x03FD2D1B6ED2FA90C  # 0.294049960734 350
+    .quad 0x03FD2D1B6ED2FA90C
+    .quad 0x03FD2DC73F01B0DD4  # 0.294705376127 351
+    .quad 0x03FD2DC73F01B0DD4
+    .quad 0x03FD2E9E2BCE12286  # 0.295525249913 352
+    .quad 0x03FD2E9E2BCE12286
+    .quad 0x03FD2F4A3CF22EDC2  # 0.296181633264 353
+    .quad 0x03FD2F4A3CF22EDC2
+    .quad 0x03FD30217B1006601  # 0.297002718785 354
+    .quad 0x03FD30217B1006601
+    .quad 0x03FD30CDCD5ABA762  # 0.297660072959 355
+    .quad 0x03FD30CDCD5ABA762
+    .quad 0x03FD31A55D07A8590  # 0.298482373803 356
+    .quad 0x03FD31A55D07A8590
+    .quad 0x03FD3251F0AA5CC1A  # 0.299140701674 357
+    .quad 0x03FD3251F0AA5CC1A
+    .quad 0x03FD32FEA167A6D70  # 0.299799463226 358
+    .quad 0x03FD32FEA167A6D70
+    .quad 0x03FD33D6A7509D491  # 0.300623525901 359
+    .quad 0x03FD33D6A7509D491
+    .quad 0x03FD348399ADA9D94  # 0.301283265328 360
+    .quad 0x03FD348399ADA9D94
+    .quad 0x03FD3530A9454ADC9  # 0.301943440298 361
+    .quad 0x03FD3530A9454ADC9
+    .quad 0x03FD360925EC44F5C  # 0.302769272371 362
+    .quad 0x03FD360925EC44F5C
+    .quad 0x03FD36B6776BE1116  # 0.303430429420 363
+    .quad 0x03FD36B6776BE1116
+    .quad 0x03FD378F469437FB4  # 0.304257490918 364
+    .quad 0x03FD378F469437FB4
+    .quad 0x03FD383CDA2E14ECB  # 0.304919632971 365
+    .quad 0x03FD383CDA2E14ECB
+    .quad 0x03FD38EA8B3924521  # 0.305582213748 366
+    .quad 0x03FD38EA8B3924521
+    .quad 0x03FD39C3D1FD60E74  # 0.306411057558 367
+    .quad 0x03FD39C3D1FD60E74
+    .quad 0x03FD3A71C56BB48C7  # 0.307074627589 368
+    .quad 0x03FD3A71C56BB48C7
+    .quad 0x03FD3B1FD66BC8D10  # 0.307738638238 369
+    .quad 0x03FD3B1FD66BC8D10
+    .quad 0x03FD3BF995502CB5C  # 0.308569272059 370
+    .quad 0x03FD3BF995502CB5C
+    .quad 0x03FD3CA7E8FD01DF6  # 0.309234276240 371
+    .quad 0x03FD3CA7E8FD01DF6
+    .quad 0x03FD3D565A5C5BF11  # 0.309899722945 372
+    .quad 0x03FD3D565A5C5BF11
+    .quad 0x03FD3E3091E6049FB  # 0.310732154526 373
+    .quad 0x03FD3E3091E6049FB
+    .quad 0x03FD3EDF463C1683E  # 0.311398599069 374
+    .quad 0x03FD3EDF463C1683E
+    .quad 0x03FD3F8E1865A82DD  # 0.312065488057 375
+    .quad 0x03FD3F8E1865A82DD
+    .quad 0x03FD403D086CEA79B  # 0.312732822082 376
+    .quad 0x03FD403D086CEA79B
+    .quad 0x03FD4117DE854CA15  # 0.313567616354 377
+    .quad 0x03FD4117DE854CA15
+    .quad 0x03FD41C711E4BA15E  # 0.314235953889 378
+    .quad 0x03FD41C711E4BA15E
+    .quad 0x03FD427663431B221  # 0.314904738398 379
+    .quad 0x03FD427663431B221
+    .quad 0x03FD4325D2AAB6F18  # 0.315573970480 380
+    .quad 0x03FD4325D2AAB6F18
+    .quad 0x03FD44014838E5513  # 0.316411140893 381
+    .quad 0x03FD44014838E5513
+    .quad 0x03FD44B0FB5AF4F44  # 0.317081382205 382
+    .quad 0x03FD44B0FB5AF4F44
+    .quad 0x03FD4560CCA7CB3B2  # 0.317752073041 383
+    .quad 0x03FD4560CCA7CB3B2
+    .quad 0x03FD4610BC29C5E18  # 0.318423214006 384
+    .quad 0x03FD4610BC29C5E18
+    .quad 0x03FD46ECD216CDCB5  # 0.319262774126 385
+    .quad 0x03FD46ECD216CDCB5
+    .quad 0x03FD479D05B65CB60  # 0.319934930091 386
+    .quad 0x03FD479D05B65CB60
+    .quad 0x03FD484D57ACE5A1A  # 0.320607538154 387
+    .quad 0x03FD484D57ACE5A1A
+    .quad 0x03FD48FDC804DD1CB  # 0.321280598924 388
+    .quad 0x03FD48FDC804DD1CB
+    .quad 0x03FD49DA7F3BCC420  # 0.322122562432 389
+    .quad 0x03FD49DA7F3BCC420
+    .quad 0x03FD4A8B341552B09  # 0.322796644021 390
+    .quad 0x03FD4A8B341552B09
+    .quad 0x03FD4B3C077267E9A  # 0.323471180303 391
+    .quad 0x03FD4B3C077267E9A
+    .quad 0x03FD4BECF95D97914  # 0.324146171892 392
+    .quad 0x03FD4BECF95D97914
+    .quad 0x03FD4C9E09E172C3D  # 0.324821619401 393
+    .quad 0x03FD4C9E09E172C3D
+    .quad 0x03FD4D4F3908901A0  # 0.325497523449 394
+    .quad 0x03FD4D4F3908901A0
+    .quad 0x03FD4E2CDF1F341C1  # 0.326343046455 395
+    .quad 0x03FD4E2CDF1F341C1
+    .quad 0x03FD4EDE535C79642  # 0.327019979972 396
+    .quad 0x03FD4EDE535C79642
+    .quad 0x03FD4F8FE65F90500  # 0.327697372039 397
+    .quad 0x03FD4F8FE65F90500
+    .quad 0x03FD5041983326F2D  # 0.328375223276 398
+    .quad 0x03FD5041983326F2D
+    .quad 0x03FD50F368E1F0F02  # 0.329053534308 399
+    .quad 0x03FD50F368E1F0F02
+    .quad 0x03FD51A55876A77F5  # 0.329732305758 400
+    .quad 0x03FD51A55876A77F5
+    .quad 0x03FD5283EF743F98B  # 0.330581418486 401
+    .quad 0x03FD5283EF743F98B
+    .quad 0x03FD533624B59CA35  # 0.331261228165 402
+    .quad 0x03FD533624B59CA35
+    .quad 0x03FD53E878FFE6EAE  # 0.331941500300 403
+    .quad 0x03FD53E878FFE6EAE
+    .quad 0x03FD549AEC5DEF880  # 0.332622235521 404
+    .quad 0x03FD549AEC5DEF880
+    .quad 0x03FD554D7EDA8D3C4  # 0.333303434457 405
+    .quad 0x03FD554D7EDA8D3C4
+    .quad 0x03FD560030809C759  # 0.333985097742 406
+    .quad 0x03FD560030809C759
+    .quad 0x03FD56B3015AFF52C  # 0.334667226008 407
+    .quad 0x03FD56B3015AFF52C
+    .quad 0x03FD5765F1749DA6C  # 0.335349819892 408
+    .quad 0x03FD5765F1749DA6C
+    .quad 0x03FD581900D864FD7  # 0.336032880027 409
+    .quad 0x03FD581900D864FD7
+    .quad 0x03FD58CC2F91489F5  # 0.336716407053 410
+    .quad 0x03FD58CC2F91489F5
+    .quad 0x03FD59AC5618CCE38  # 0.337571473373 411
+    .quad 0x03FD59AC5618CCE38
+    .quad 0x03FD5A5FCB795780C  # 0.338256053239 412
+    .quad 0x03FD5A5FCB795780C
+    .quad 0x03FD5B136052BCE39  # 0.338941102075 413
+    .quad 0x03FD5B136052BCE39
+    .quad 0x03FD5BC714B008E23  # 0.339626620526 414
+    .quad 0x03FD5BC714B008E23
+    .quad 0x03FD5C7AE89C4D254  # 0.340312609234 415
+    .quad 0x03FD5C7AE89C4D254
+    .quad 0x03FD5D2EDC22A12BA  # 0.340999068845 416
+    .quad 0x03FD5D2EDC22A12BA
+    .quad 0x03FD5DE2EF4E224D6  # 0.341686000008 417
+    .quad 0x03FD5DE2EF4E224D6
+    .quad 0x03FD5E972229F3C15  # 0.342373403369 418
+    .quad 0x03FD5E972229F3C15
+    .quad 0x03FD5F4B74C13EA04  # 0.343061279578 419
+    .quad 0x03FD5F4B74C13EA04
+    .quad 0x03FD5FFFE71F31E9A  # 0.343749629287 420
+    .quad 0x03FD5FFFE71F31E9A
+    .quad 0x03FD60B4794F02875  # 0.344438453147 421
+    .quad 0x03FD60B4794F02875
+    .quad 0x03FD61692B5BEB520  # 0.345127751813 422
+    .quad 0x03FD61692B5BEB520
+    .quad 0x03FD621DFD512D14F  # 0.345817525940 423
+    .quad 0x03FD621DFD512D14F
+    .quad 0x03FD62D2EF3A0E933  # 0.346507776183 424
+    .quad 0x03FD62D2EF3A0E933
+    .quad 0x03FD63880121DC8AB  # 0.347198503200 425
+    .quad 0x03FD63880121DC8AB
+    .quad 0x03FD643D3313E9B92  # 0.347889707652 426
+    .quad 0x03FD643D3313E9B92
+    .quad 0x03FD64F2851B8EE01  # 0.348581390197 427
+    .quad 0x03FD64F2851B8EE01
+    .quad 0x03FD65A7F7442AC90  # 0.349273551498 428
+    .quad 0x03FD65A7F7442AC90
+    .quad 0x03FD665D8999224A5  # 0.349966192218 429
+    .quad 0x03FD665D8999224A5
+    .quad 0x03FD67133C25E04A5  # 0.350659313022 430
+    .quad 0x03FD67133C25E04A5
+    .quad 0x03FD67C90EF5D5C4C  # 0.351352914576 431
+    .quad 0x03FD67C90EF5D5C4C
+    .quad 0x03FD687F021479CEE  # 0.352046997547 432
+    .quad 0x03FD687F021479CEE
+    .quad 0x03FD6935158D499B3  # 0.352741562603 433
+    .quad 0x03FD6935158D499B3
+    .quad 0x03FD69EB496BC87E5  # 0.353436610416 434
+    .quad 0x03FD69EB496BC87E5
+    .quad 0x03FD6AA19DBB7FF34  # 0.354132141656 435
+    .quad 0x03FD6AA19DBB7FF34
+    .quad 0x03FD6B581287FF9FD  # 0.354828156996 436
+    .quad 0x03FD6B581287FF9FD
+    .quad 0x03FD6C0EA7DCDD591  # 0.355524657112 437
+    .quad 0x03FD6C0EA7DCDD591
+    .quad 0x03FD6C97AD3CFCFD9  # 0.356047350738 438
+    .quad 0x03FD6C97AD3CFCFD9
+    .quad 0x03FD6D4E7B9C727EC  # 0.356744700836 439
+    .quad 0x03FD6D4E7B9C727EC
+    .quad 0x03FD6E056AA4421D6  # 0.357442537571 440
+    .quad 0x03FD6E056AA4421D6
+    .quad 0x03FD6EBC7A6019066  # 0.358140861621 441
+    .quad 0x03FD6EBC7A6019066
+    .quad 0x03FD6F73AADBAAAB7  # 0.358839673669 442
+    .quad 0x03FD6F73AADBAAAB7
+    .quad 0x03FD702AFC22B0C6D  # 0.359538974397 443
+    .quad 0x03FD702AFC22B0C6D
+    .quad 0x03FD70E26E40EB5FA  # 0.360238764489 444
+    .quad 0x03FD70E26E40EB5FA
+    .quad 0x03FD719A014220CF5  # 0.360939044629 445
+    .quad 0x03FD719A014220CF5
+    .quad 0x03FD7251B5321DC54  # 0.361639815506 446
+    .quad 0x03FD7251B5321DC54
+    .quad 0x03FD73098A1CB54BA  # 0.362341077807 447
+    .quad 0x03FD73098A1CB54BA
+    .quad 0x03FD73937F783CEBA  # 0.362867347444 448
+    .quad 0x03FD73937F783CEBA
+    .quad 0x03FD744B8E35E9EDA  # 0.363569471398 449
+    .quad 0x03FD744B8E35E9EDA
+    .quad 0x03FD7503BE0ED6C66  # 0.364272088676 450
+    .quad 0x03FD7503BE0ED6C66
+    .quad 0x03FD75BC0F0EEE7DE  # 0.364975199972 451
+    .quad 0x03FD75BC0F0EEE7DE
+    .quad 0x03FD76748142228C7  # 0.365678805982 452
+    .quad 0x03FD76748142228C7
+    .quad 0x03FD772D14B46AE00  # 0.366382907402 453
+    .quad 0x03FD772D14B46AE00
+    .quad 0x03FD77E5C971C5E06  # 0.367087504930 454
+    .quad 0x03FD77E5C971C5E06
+    .quad 0x03FD787066E04915F  # 0.367616279067 455
+    .quad 0x03FD787066E04915F
+    .quad 0x03FD792955FDF47A3  # 0.368321746469 456
+    .quad 0x03FD792955FDF47A3
+    .quad 0x03FD79E26687CFB3D  # 0.369027711906 457
+    .quad 0x03FD79E26687CFB3D
+    .quad 0x03FD7A9B9889F19E2  # 0.369734176082 458
+    .quad 0x03FD7A9B9889F19E2
+    .quad 0x03FD7B54EC1077A48  # 0.370441139703 459
+    .quad 0x03FD7B54EC1077A48
+    .quad 0x03FD7C0E612785C74  # 0.371148603475 460
+    .quad 0x03FD7C0E612785C74
+    .quad 0x03FD7C998F06FB152  # 0.371679529954 461
+    .quad 0x03FD7C998F06FB152
+    .quad 0x03FD7D533EF841E8A  # 0.372387870696 462
+    .quad 0x03FD7D533EF841E8A
+    .quad 0x03FD7E0D109B95F19  # 0.373096713539 463
+    .quad 0x03FD7E0D109B95F19
+    .quad 0x03FD7EC703FD340AA  # 0.373806059198 464
+    .quad 0x03FD7EC703FD340AA
+    .quad 0x03FD7F8119295FB9B  # 0.374515908385 465
+    .quad 0x03FD7F8119295FB9B
+    .quad 0x03FD800CBF3ED1CC2  # 0.375048626146 466
+    .quad 0x03FD800CBF3ED1CC2
+    .quad 0x03FD80C70FAB0BDF6  # 0.375759358229 467
+    .quad 0x03FD80C70FAB0BDF6
+    .quad 0x03FD81818203AFC7F  # 0.376470595813 468
+    .quad 0x03FD81818203AFC7F
+    .quad 0x03FD823C16551A3C3  # 0.377182339615 469
+    .quad 0x03FD823C16551A3C3
+    .quad 0x03FD82C81BE4DFF4A  # 0.377716480107 470
+    .quad 0x03FD82C81BE4DFF4A
+    .quad 0x03FD8382EBC7794D1  # 0.378429111528 471
+    .quad 0x03FD8382EBC7794D1
+    .quad 0x03FD843DDDC4FB137  # 0.379142251156 472
+    .quad 0x03FD843DDDC4FB137
+    .quad 0x03FD84F8F1E9DB72B  # 0.379855899714 473
+    .quad 0x03FD84F8F1E9DB72B
+    .quad 0x03FD85855776DCBFB  # 0.380391470556 474
+    .quad 0x03FD85855776DCBFB
+    .quad 0x03FD8640A77EB3957  # 0.381106011494 475
+    .quad 0x03FD8640A77EB3957
+    .quad 0x03FD86FC19D05148E  # 0.381821063366 476
+    .quad 0x03FD86FC19D05148E
+    .quad 0x03FD87B7AE7845C0F  # 0.382536626902 477
+    .quad 0x03FD87B7AE7845C0F
+    .quad 0x03FD8844748678822  # 0.383073635776 478
+    .quad 0x03FD8844748678822
+    .quad 0x03FD89004563D3DFD  # 0.383790096491 479
+    .quad 0x03FD89004563D3DFD
+    .quad 0x03FD89BC38BA356B4  # 0.384507070890 480
+    .quad 0x03FD89BC38BA356B4
+    .quad 0x03FD8A4945E20894E  # 0.385045139237 481
+    .quad 0x03FD8A4945E20894E
+    .quad 0x03FD8B0575AAB1FC5  # 0.385763014358 482
+    .quad 0x03FD8B0575AAB1FC5
+    .quad 0x03FD8BC1C80F45A32  # 0.386481405193 483
+    .quad 0x03FD8BC1C80F45A32
+    .quad 0x03FD8C7E3D1C80B2F  # 0.387200312485 484
+    .quad 0x03FD8C7E3D1C80B2F
+    .quad 0x03FD8D0BABACC89EE  # 0.387739832326 485
+    .quad 0x03FD8D0BABACC89EE
+    .quad 0x03FD8DC85D7FE5013  # 0.388459645206 486
+    .quad 0x03FD8DC85D7FE5013
+    .quad 0x03FD8E85321ED5598  # 0.389179976589 487
+    .quad 0x03FD8E85321ED5598
+    .quad 0x03FD8F12E873862C7  # 0.389720565845 488
+    .quad 0x03FD8F12E873862C7
+    .quad 0x03FD8FCFFA1614AA0  # 0.390441806410 489
+    .quad 0x03FD8FCFFA1614AA0
+    .quad 0x03FD908D2EA7D9511  # 0.391163567538 490
+    .quad 0x03FD908D2EA7D9511
+    .quad 0x03FD911B2D09ED9D6  # 0.391705230456 491
+    .quad 0x03FD911B2D09ED9D6
+    .quad 0x03FD91D89EDD6B7FF  # 0.392427904381 492
+    .quad 0x03FD91D89EDD6B7FF
+    .quad 0x03FD929633C3B7D3E  # 0.393151100941 493
+    .quad 0x03FD929633C3B7D3E
+    .quad 0x03FD93247A7C99B52  # 0.393693841796 494
+    .quad 0x03FD93247A7C99B52
+    .quad 0x03FD93E24CE3195E8  # 0.394417954789 495
+    .quad 0x03FD93E24CE3195E8
+    .quad 0x03FD9470C1CB1962E  # 0.394961383840 496
+    .quad 0x03FD9470C1CB1962E
+    .quad 0x03FD952ED1D9C0435  # 0.395686415592 497
+    .quad 0x03FD952ED1D9C0435
+    .quad 0x03FD95ED0535EA5D9  # 0.396411973396 498
+    .quad 0x03FD95ED0535EA5D9
+    .quad 0x03FD967BC2EDCCE17  # 0.396956487431 499
+    .quad 0x03FD967BC2EDCCE17
+    .quad 0x03FD973A3431356AE  # 0.397682967666 500
+    .quad 0x03FD973A3431356AE
+    .quad 0x03FD97F8C8E64A1C7  # 0.398409976059 501
+    .quad 0x03FD97F8C8E64A1C7
+    .quad 0x03FD9887CFB8A3932  # 0.398955579419 502
+    .quad 0x03FD9887CFB8A3932
+    .quad 0x03FD9946A2946EF3C  # 0.399683513937 503
+    .quad 0x03FD9946A2946EF3C
+    .quad 0x03FD99D5D8130607C  # 0.400229812776 504
+    .quad 0x03FD99D5D8130607C
+    .quad 0x03FD9A94E93E1EC37  # 0.400958675782 505
+    .quad 0x03FD9A94E93E1EC37
+    .quad 0x03FD9B244D87735E8  # 0.401505671875 506
+    .quad 0x03FD9B244D87735E8
+    .quad 0x03FD9BE39D2A97F0B  # 0.402235465741 507
+    .quad 0x03FD9BE39D2A97F0B
+    .quad 0x03FD9CA3109266E23  # 0.402965792595 508
+    .quad 0x03FD9CA3109266E23
+    .quad 0x03FD9D32BEA15ED3A  # 0.403513887977 509
+    .quad 0x03FD9D32BEA15ED3A
+    .quad 0x03FD9DF270C1914A8  # 0.404245149435 510
+    .quad 0x03FD9DF270C1914A8
+    .quad 0x03FD9E824DEA3E135  # 0.404793946669 511
+    .quad 0x03FD9E824DEA3E135
+    .quad 0x03FD9F423EEBF9DA1  # 0.405526145127 512
+    .quad 0x03FD9F423EEBF9DA1
+    .quad 0x03FD9FD24B4D47012  # 0.406075646011 513
+    .quad 0x03FD9FD24B4D47012
+    .quad 0x03FDA0927B59DA6E2  # 0.406808783874 514
+    .quad 0x03FDA0927B59DA6E2
+    .quad 0x03FDA152CF7F3B46D  # 0.407542459622 515
+    .quad 0x03FDA152CF7F3B46D
+    .quad 0x03FDA1E32653B420E  # 0.408093069896 516
+    .quad 0x03FDA1E32653B420E
+    .quad 0x03FDA2A3B9C527DB1  # 0.408827688845 517
+    .quad 0x03FDA2A3B9C527DB1
+    .quad 0x03FDA33440224FA79  # 0.409379007429 518
+    .quad 0x03FDA33440224FA79
+    .quad 0x03FDA3F513098DD09  # 0.410114572008 519
+    .quad 0x03FDA3F513098DD09
+    .quad 0x03FDA485C90EBDB0C  # 0.410666600728 520
+    .quad 0x03FDA485C90EBDB0C
+    .quad 0x03FDA546DB95A721A  # 0.411403113374 521
+    .quad 0x03FDA546DB95A721A
+    .quad 0x03FDA5D7C16257437  # 0.411955854060 522
+    .quad 0x03FDA5D7C16257437
+    .quad 0x03FDA69913B2F6572  # 0.412693317221 523
+    .quad 0x03FDA69913B2F6572
+    .quad 0x03FDA72A2966BE1EA  # 0.413246771713 524
+    .quad 0x03FDA72A2966BE1EA
+    .quad 0x03FDA7EBBBAB46E8B  # 0.413985187844 525
+    .quad 0x03FDA7EBBBAB46E8B
+    .quad 0x03FDA87D0165DD199  # 0.414539357989 526
+    .quad 0x03FDA87D0165DD199
+    .quad 0x03FDA93ED3C8AD9E3  # 0.415278729556 527
+    .quad 0x03FDA93ED3C8AD9E3
+    .quad 0x03FDA9D049A9E884A  # 0.415833617206 528
+    .quad 0x03FDA9D049A9E884A
+    .quad 0x03FDAA925C5588EFA  # 0.416573946686 529
+    .quad 0x03FDAA925C5588EFA
+    .quad 0x03FDAB24027D5E8AF  # 0.417129553701 530
+    .quad 0x03FDAB24027D5E8AF
+    .quad 0x03FDABE6559C8167C  # 0.417870843580 531
+    .quad 0x03FDABE6559C8167C
+    .quad 0x03FDAC782C2B07944  # 0.418427171828 532
+    .quad 0x03FDAC782C2B07944
+    .quad 0x03FDAD3ABFE88A06E  # 0.419169424599 533
+    .quad 0x03FDAD3ABFE88A06E
+    .quad 0x03FDADCCC6FDF6A80  # 0.419726475955 534
+    .quad 0x03FDADCCC6FDF6A80
+    .quad 0x03FDAE5EE2E961227  # 0.420283837790 535
+    .quad 0x03FDAE5EE2E961227
+    .quad 0x03FDAF21D34189D0A  # 0.421027470470 536
+    .quad 0x03FDAF21D34189D0A
+    .quad 0x03FDAFB41FE2167B4  # 0.421585558104 537
+    .quad 0x03FDAFB41FE2167B4
+    .quad 0x03FDB07751416A7F3  # 0.422330159776 538
+    .quad 0x03FDB07751416A7F3
+    .quad 0x03FDB109CEB79DB8A  # 0.422888975102 539
+    .quad 0x03FDB109CEB79DB8A
+    .quad 0x03FDB1CD41498DF12  # 0.423634548296 540
+    .quad 0x03FDB1CD41498DF12
+    .quad 0x03FDB25FEFB60CB2E  # 0.424194093214 541
+    .quad 0x03FDB25FEFB60CB2E
+    .quad 0x03FDB323A3A63594A  # 0.424940640468 542
+    .quad 0x03FDB323A3A63594A
+    .quad 0x03FDB3B68329C59E9  # 0.425500916886 543
+    .quad 0x03FDB3B68329C59E9
+    .quad 0x03FDB44977C148F1A  # 0.426061507389 544
+    .quad 0x03FDB44977C148F1A
+    .quad 0x03FDB50D895F7773A  # 0.426809450580 545
+    .quad 0x03FDB50D895F7773A
+    .quad 0x03FDB5A0AF3D169CD  # 0.427370775322 546
+    .quad 0x03FDB5A0AF3D169CD
+    .quad 0x03FDB66502A41E541  # 0.428119698779 547
+    .quad 0x03FDB66502A41E541
+    .quad 0x03FDB6F859E8EF639  # 0.428681759684 548
+    .quad 0x03FDB6F859E8EF639
+    .quad 0x03FDB78BC664238C0  # 0.429244136679 549
+    .quad 0x03FDB78BC664238C0
+    .quad 0x03FDB85078123E586  # 0.429994464983 550
+    .quad 0x03FDB85078123E586
+    .quad 0x03FDB8E41624226C5  # 0.430557580905 551
+    .quad 0x03FDB8E41624226C5
+    .quad 0x03FDB9A90A06BCB3D  # 0.431308895742 552
+    .quad 0x03FDB9A90A06BCB3D
+    .quad 0x03FDBA3CD9D0B81BD  # 0.431872752537 553
+    .quad 0x03FDBA3CD9D0B81BD
+    .quad 0x03FDBAD0BEF3DB164  # 0.432436927446 554
+    .quad 0x03FDBAD0BEF3DB164
+    .quad 0x03FDBB9611B80E2FC  # 0.433189656123 555
+    .quad 0x03FDBB9611B80E2FC
+    .quad 0x03FDBC2A28C33B75D  # 0.433754574696 556
+    .quad 0x03FDBC2A28C33B75D
+    .quad 0x03FDBCBE553C2BDDF  # 0.434319812582 557
+    .quad 0x03FDBCBE553C2BDDF
+    .quad 0x03FDBD84073D8EC2B  # 0.435073960430 558
+    .quad 0x03FDBD84073D8EC2B
+    .quad 0x03FDBE1865CEC1EC9  # 0.435639944787 559
+    .quad 0x03FDBE1865CEC1EC9
+    .quad 0x03FDBEACD9E271AD1  # 0.436206249662 560
+    .quad 0x03FDBEACD9E271AD1
+    .quad 0x03FDBF72EB7D20355  # 0.436961822044 561
+    .quad 0x03FDBF72EB7D20355
+    .quad 0x03FDC00791D99132B  # 0.437528876213 562
+    .quad 0x03FDC00791D99132B
+    .quad 0x03FDC09C4DCD565AB  # 0.438096252115 563
+    .quad 0x03FDC09C4DCD565AB
+    .quad 0x03FDC162BF5DF23E4  # 0.438853254422 564
+    .quad 0x03FDC162BF5DF23E4
+    .quad 0x03FDC1F7ADCB3DAB0  # 0.439421382456 565
+    .quad 0x03FDC1F7ADCB3DAB0
+    .quad 0x03FDC28CB1E4D32FD  # 0.439989833442 566
+    .quad 0x03FDC28CB1E4D32FD
+    .quad 0x03FDC35383C8850B0  # 0.440748271097 567
+    .quad 0x03FDC35383C8850B0
+    .quad 0x03FDC3E8BA8CACF27  # 0.441317477070 568
+    .quad 0x03FDC3E8BA8CACF27
+    .quad 0x03FDC47E071233744  # 0.441887007223 569
+    .quad 0x03FDC47E071233744
+    .quad 0x03FDC54539A6ABCD2  # 0.442646885679 570
+    .quad 0x03FDC54539A6ABCD2
+    .quad 0x03FDC5DAB908186FF  # 0.443217173690 571
+    .quad 0x03FDC5DAB908186FF
+    .quad 0x03FDC6704E4016FF7  # 0.443787787115 572
+    .quad 0x03FDC6704E4016FF7
+    .quad 0x03FDC737E1E38F4FB  # 0.444549111857 573
+    .quad 0x03FDC737E1E38F4FB
+    .quad 0x03FDC7CDAA290FEAD  # 0.445120486027 574
+    .quad 0x03FDC7CDAA290FEAD
+    .quad 0x03FDC863885A74D16  # 0.445692186852 575
+    .quad 0x03FDC863885A74D16
+    .quad 0x03FDC8F97C7E299DB  # 0.446264214707 576
+    .quad 0x03FDC8F97C7E299DB
+    .quad 0x03FDC9C18EDC7C26B  # 0.447027427871 577
+    .quad 0x03FDC9C18EDC7C26B
+    .quad 0x03FDCA57B64E9DB05  # 0.447600220249 578
+    .quad 0x03FDCA57B64E9DB05
+    .quad 0x03FDCAEDF3C88A364  # 0.448173340907 579
+    .quad 0x03FDCAEDF3C88A364
+    .quad 0x03FDCB844750B9995  # 0.448746790220 580
+    .quad 0x03FDCB844750B9995
+    .quad 0x03FDCC4CD90B3ECE5  # 0.449511901199 581
+    .quad 0x03FDCC4CD90B3ECE5
+    .quad 0x03FDCCE3602341C10  # 0.450086118843 582
+    .quad 0x03FDCCE3602341C10
+    .quad 0x03FDCD79FD5F2BC77  # 0.450660666403 583
+    .quad 0x03FDCD79FD5F2BC77
+    .quad 0x03FDCE10B0C581284  # 0.451235544257 584
+    .quad 0x03FDCE10B0C581284
+    .quad 0x03FDCED9C27EC6607  # 0.452002562511 585
+    .quad 0x03FDCED9C27EC6607
+    .quad 0x03FDCF70A9B6D3810  # 0.452578212532 586
+    .quad 0x03FDCF70A9B6D3810
+    .quad 0x03FDD007A72F19BBC  # 0.453154194116 587
+    .quad 0x03FDD007A72F19BBC
+    .quad 0x03FDD09EBAEE29DD8  # 0.453730507647 588
+    .quad 0x03FDD09EBAEE29DD8
+    .quad 0x03FDD1684D49F46AE  # 0.454499442710 589
+    .quad 0x03FDD1684D49F46AE
+    .quad 0x03FDD1FF951D1F1B3  # 0.455076532271 590
+    .quad 0x03FDD1FF951D1F1B3
+    .quad 0x03FDD296F34D0B65C  # 0.455653955057 591
+    .quad 0x03FDD296F34D0B65C
+    .quad 0x03FDD32E67E056BD5  # 0.456231711452 592
+    .quad 0x03FDD32E67E056BD5
+    .quad 0x03FDD3C5F2DDA1840  # 0.456809801843 593
+    .quad 0x03FDD3C5F2DDA1840
+    .quad 0x03FDD490246DEFA6A  # 0.457581109247 594
+    .quad 0x03FDD490246DEFA6A
+    .quad 0x03FDD527E3D1B95FC  # 0.458159980465 595
+    .quad 0x03FDD527E3D1B95FC
+    .quad 0x03FDD5BFB9B5AE71F  # 0.458739186968 596
+    .quad 0x03FDD5BFB9B5AE71F
+    .quad 0x03FDD657A6207C0DB  # 0.459318729146 597
+    .quad 0x03FDD657A6207C0DB
+    .quad 0x03FDD6EFA918D25CE  # 0.459898607388 598
+    .quad 0x03FDD6EFA918D25CE
+    .quad 0x03FDD7BA7AD9E7DA1  # 0.460672301817 599
+    .quad 0x03FDD7BA7AD9E7DA1
+    .quad 0x03FDD852B28BE5A0F  # 0.461252965726 600
+    .quad 0x03FDD852B28BE5A0F
+    .quad 0x03FDD8EB00E1CCE14  # 0.461833967001 601
+    .quad 0x03FDD8EB00E1CCE14
+    .quad 0x03FDD98365E25ABB9  # 0.462415306035 602
+    .quad 0x03FDD98365E25ABB9
+    .quad 0x03FDDA1BE1944F538  # 0.462996983220 603
+    .quad 0x03FDDA1BE1944F538
+    .quad 0x03FDDAE75484C9615  # 0.463773079495 604
+    .quad 0x03FDDAE75484C9615
+    .quad 0x03FDDB8005445488B  # 0.464355547233 605
+    .quad 0x03FDDB8005445488B
+    .quad 0x03FDDC18CCCBDCB83  # 0.464938354438 606
+    .quad 0x03FDDC18CCCBDCB83
+    .quad 0x03FDDCB1AB222F33D  # 0.465521501504 607
+    .quad 0x03FDDCB1AB222F33D
+    .quad 0x03FDDD4AA04E1C4B7  # 0.466104988830 608
+    .quad 0x03FDDD4AA04E1C4B7
+    .quad 0x03FDDDE3AC56775D2  # 0.466688816812 609
+    .quad 0x03FDDDE3AC56775D2
+    .quad 0x03FDDE7CCF4216D6E  # 0.467272985848 610
+    .quad 0x03FDDE7CCF4216D6E
+    .quad 0x03FDDF492177D7BBC  # 0.468052409114 611
+    .quad 0x03FDDF492177D7BBC
+    .quad 0x03FDDFE279E5BF4EE  # 0.468637375496 612
+    .quad 0x03FDDFE279E5BF4EE
+    .quad 0x03FDE07BE94DCC439  # 0.469222684263 613
+    .quad 0x03FDE07BE94DCC439
+    .quad 0x03FDE1156FB6E2626  # 0.469808335817 614
+    .quad 0x03FDE1156FB6E2626
+    .quad 0x03FDE1AF0D27E88D7  # 0.470394330560 615
+    .quad 0x03FDE1AF0D27E88D7
+    .quad 0x03FDE248C1A7C8C26  # 0.470980668894 616
+    .quad 0x03FDE248C1A7C8C26
+    .quad 0x03FDE2E28D3D701CC  # 0.471567351222 617
+    .quad 0x03FDE2E28D3D701CC
+    .quad 0x03FDE37C6FEFCED73  # 0.472154377948 618
+    .quad 0x03FDE37C6FEFCED73
+    .quad 0x03FDE449C232C39D8  # 0.472937616681 619
+    .quad 0x03FDE449C232C39D8
+    .quad 0x03FDE4E3DAEDDB5F6  # 0.473525448578 620
+    .quad 0x03FDE4E3DAEDDB5F6
+    .quad 0x03FDE57E0ADCE1EA5  # 0.474113626224 621
+    .quad 0x03FDE57E0ADCE1EA5
+    .quad 0x03FDE6185206D516F  # 0.474702150027 622
+    .quad 0x03FDE6185206D516F
+    .quad 0x03FDE6B2B072B5E6F  # 0.475291020395 623
+    .quad 0x03FDE6B2B072B5E6F
+    .quad 0x03FDE74D26278887A  # 0.475880237735 624
+    .quad 0x03FDE74D26278887A
+    .quad 0x03FDE7E7B32C5453F  # 0.476469802457 625
+    .quad 0x03FDE7E7B32C5453F
+    .quad 0x03FDE882578823D52  # 0.477059714970 626
+    .quad 0x03FDE882578823D52
+    .quad 0x03FDE91D134204C67  # 0.477649975686 627
+    .quad 0x03FDE91D134204C67
+    .quad 0x03FDE9B7E6610815A  # 0.478240585015 628
+    .quad 0x03FDE9B7E6610815A
+    .quad 0x03FDEA52D0EC41E5E  # 0.478831543369 629
+    .quad 0x03FDEA52D0EC41E5E
+    .quad 0x03FDEB218376ECFC0  # 0.479620031484 630
+    .quad 0x03FDEB218376ECFC0
+    .quad 0x03FDEBBCA4C4E9E87  # 0.480211805838 631
+    .quad 0x03FDEBBCA4C4E9E87
+    .quad 0x03FDEC57DD96CD0CB  # 0.480803930597 632
+    .quad 0x03FDEC57DD96CD0CB
+    .quad 0x03FDECF32DF3B887D  # 0.481396406174 633
+    .quad 0x03FDECF32DF3B887D
+    .quad 0x03FDED8E95E2D1B88  # 0.481989232987 634
+    .quad 0x03FDED8E95E2D1B88
+    .quad 0x03FDEE2A156B413E5  # 0.482582411453 635
+    .quad 0x03FDEE2A156B413E5
+    .quad 0x03FDEEC5AC9432FCB  # 0.483175941987 636
+    .quad 0x03FDEEC5AC9432FCB
+    .quad 0x03FDEF615B64D61C7  # 0.483769825010 637
+    .quad 0x03FDEF615B64D61C7
+    .quad 0x03FDEFFD21E45D0D1  # 0.484364060939 638
+    .quad 0x03FDEFFD21E45D0D1
+    .quad 0x03FDF0990019FD887  # 0.484958650194 639
+    .quad 0x03FDF0990019FD887
+    .quad 0x03FDF134F60CF092D  # 0.485553593197 640
+    .quad 0x03FDF134F60CF092D
+    .quad 0x03FDF1D103C4727E4  # 0.486148890367 641
+    .quad 0x03FDF1D103C4727E4
+    .quad 0x03FDF26D2947C2EC5  # 0.486744542127 642
+    .quad 0x03FDF26D2947C2EC5
+    .quad 0x03FDF309669E24CF9  # 0.487340548899 643
+    .quad 0x03FDF309669E24CF9
+    .quad 0x03FDF3A5BBCEDE6E1  # 0.487936911107 644
+    .quad 0x03FDF3A5BBCEDE6E1
+    .quad 0x03FDF44228E13963A  # 0.488533629176 645
+    .quad 0x03FDF44228E13963A
+    .quad 0x03FDF4DEADDC82A35  # 0.489130703529 646
+    .quad 0x03FDF4DEADDC82A35
+    .quad 0x03FDF57B4AC80A79A  # 0.489728134594 647
+    .quad 0x03FDF57B4AC80A79A
+    .quad 0x03FDF617FFAB248ED  # 0.490325922795 648
+    .quad 0x03FDF617FFAB248ED
+    .quad 0x03FDF6B4CC8D27E87  # 0.490924068561 649
+    .quad 0x03FDF6B4CC8D27E87
+    .quad 0x03FDF751B1756EEC8  # 0.491522572320 650
+    .quad 0x03FDF751B1756EEC8
+    .quad 0x03FDF7EEAE6B5761C  # 0.492121434499 651
+    .quad 0x03FDF7EEAE6B5761C
+    .quad 0x03FDF88BC3764273B  # 0.492720655530 652
+    .quad 0x03FDF88BC3764273B
+    .quad 0x03FDF928F09D94B32  # 0.493320235842 653
+    .quad 0x03FDF928F09D94B32
+    .quad 0x03FDF9C635E8B6192  # 0.493920175866 654
+    .quad 0x03FDF9C635E8B6192
+    .quad 0x03FDFA63935F1208C  # 0.494520476034 655
+    .quad 0x03FDFA63935F1208C
+    .quad 0x03FDFB0109081751A  # 0.495121136779 656
+    .quad 0x03FDFB0109081751A
+    .quad 0x03FDFB9E96EB38311  # 0.495722158534 657
+    .quad 0x03FDFB9E96EB38311
+    .quad 0x03FDFC3C3D0FEA555  # 0.496323541733 658
+    .quad 0x03FDFC3C3D0FEA555
+    .quad 0x03FDFCD9FB7DA6DEF  # 0.496925286812 659
+    .quad 0x03FDFCD9FB7DA6DEF
+    .quad 0x03FDFD77D23BEA634  # 0.497527394206 660
+    .quad 0x03FDFD77D23BEA634
+    .quad 0x03FDFE15C15234EE2  # 0.498129864352 661
+    .quad 0x03FDFE15C15234EE2
+    .quad 0x03FDFEB3C8C80A04E  # 0.498732697687 662
+    .quad 0x03FDFEB3C8C80A04E
+    .quad 0x03FDFF51E8A4F0A74  # 0.499335894649 663
+    .quad 0x03FDFF51E8A4F0A74
+    .quad 0x03FDFFF020F07352E  # 0.499939455677 664
+    .quad 0x03FDFFF020F07352E
+    .quad 0x03FE004738D910023  # 0.500543381211 665
+    .quad 0x03FE004738D910023
+    .quad 0x03FE00966D78C41CF  # 0.501147671692 666
+    .quad 0x03FE00966D78C41CF
+    .quad 0x03FE00E5AE5B207AB  # 0.501752327560 667
+    .quad 0x03FE00E5AE5B207AB
+    .quad 0x03FE011A8B18F0ED6  # 0.502155634684 668
+    .quad 0x03FE011A8B18F0ED6
+    .quad 0x03FE0169E072D7311  # 0.502760900515 669
+    .quad 0x03FE0169E072D7311
+    .quad 0x03FE01B942198A5A1  # 0.503366532915 670
+    .quad 0x03FE01B942198A5A1
+    .quad 0x03FE0208B010DB642  # 0.503972532327 671
+    .quad 0x03FE0208B010DB642
+    .quad 0x03FE02582A5C9D122  # 0.504578899198 672
+    .quad 0x03FE02582A5C9D122
+    .quad 0x03FE02A7B100A3EF0  # 0.505185633972 673
+    .quad 0x03FE02A7B100A3EF0
+    .quad 0x03FE02F74400C64EA  # 0.505792737097 674
+    .quad 0x03FE02F74400C64EA
+    .quad 0x03FE0346E360DC4F9  # 0.506400209020 675
+    .quad 0x03FE0346E360DC4F9
+    .quad 0x03FE03968F24BFDB6  # 0.507008050190 676
+    .quad 0x03FE03968F24BFDB6
+    .quad 0x03FE03E647504CA89  # 0.507616261055 677
+    .quad 0x03FE03E647504CA89
+    .quad 0x03FE04360BE7603AE  # 0.508224842066 678
+    .quad 0x03FE04360BE7603AE
+    .quad 0x03FE046B4089BE0FD  # 0.508630768599 679
+    .quad 0x03FE046B4089BE0FD
+    .quad 0x03FE04BB19DCA36B3  # 0.509239967521 680
+    .quad 0x03FE04BB19DCA36B3
+    .quad 0x03FE050AFFA5671A5  # 0.509849537793 681
+    .quad 0x03FE050AFFA5671A5
+    .quad 0x03FE055AF1E7ED47B  # 0.510459479867 682
+    .quad 0x03FE055AF1E7ED47B
+    .quad 0x03FE05AAF0A81BF04  # 0.511069794198 683
+    .quad 0x03FE05AAF0A81BF04
+    .quad 0x03FE05FAFBE9DAE58  # 0.511680481240 684
+    .quad 0x03FE05FAFBE9DAE58
+    .quad 0x03FE064B13B113CDD  # 0.512291541448 685
+    .quad 0x03FE064B13B113CDD
+    .quad 0x03FE069B3801B2263  # 0.512902975280 686
+    .quad 0x03FE069B3801B2263
+    .quad 0x03FE06D0AC85B63A2  # 0.513310805628 687
+    .quad 0x03FE06D0AC85B63A2
+    .quad 0x03FE0720E5C40DF1D  # 0.513922863181 688
+    .quad 0x03FE0720E5C40DF1D
+    .quad 0x03FE07712B9648153  # 0.514535295577 689
+    .quad 0x03FE07712B9648153
+    .quad 0x03FE07C17E0056E7C  # 0.515148103277 690
+    .quad 0x03FE07C17E0056E7C
+    .quad 0x03FE0811DD062E889  # 0.515761286740 691
+    .quad 0x03FE0811DD062E889
+    .quad 0x03FE086248ABC4F3B  # 0.516374846428 692
+    .quad 0x03FE086248ABC4F3B
+    .quad 0x03FE08B2C0F512033  # 0.516988782802 693
+    .quad 0x03FE08B2C0F512033
+    .quad 0x03FE08E86D82DA3EE  # 0.517398283218 694
+    .quad 0x03FE08E86D82DA3EE
+    .quad 0x03FE0938FAE5D8E9B  # 0.518012848432 695
+    .quad 0x03FE0938FAE5D8E9B
+    .quad 0x03FE098994F72C539  # 0.518627791569 696
+    .quad 0x03FE098994F72C539
+    .quad 0x03FE09DA3BBAD339C  # 0.519243113094 697
+    .quad 0x03FE09DA3BBAD339C
+    .quad 0x03FE0A2AEF34CE3D1  # 0.519858813473 698
+    .quad 0x03FE0A2AEF34CE3D1
+    .quad 0x03FE0A7BAF691FE34  # 0.520474893172 699
+    .quad 0x03FE0A7BAF691FE34
+    .quad 0x03FE0AB18BF5823C3  # 0.520885823936 700
+    .quad 0x03FE0AB18BF5823C3
+    .quad 0x03FE0B02616952989  # 0.521502536876 701
+    .quad 0x03FE0B02616952989
+    .quad 0x03FE0B5343A234476  # 0.522119630385 702
+    .quad 0x03FE0B5343A234476
+    .quad 0x03FE0BA432A430CA2  # 0.522737104934 703
+    .quad 0x03FE0BA432A430CA2
+    .quad 0x03FE0BF52E73538CE  # 0.523354960993 704
+    .quad 0x03FE0BF52E73538CE
+    .quad 0x03FE0C463713A9E6F  # 0.523973199034 705
+    .quad 0x03FE0C463713A9E6F
+    .quad 0x03FE0C7C43F4C861E  # 0.524385570174 706
+    .quad 0x03FE0C7C43F4C861E
+    .quad 0x03FE0CCD61FAD07D2  # 0.525004445903 707
+    .quad 0x03FE0CCD61FAD07D2
+    .quad 0x03FE0D1E8CDCE3DB6  # 0.525623704876 708
+    .quad 0x03FE0D1E8CDCE3DB6
+    .quad 0x03FE0D6FC49F16E93  # 0.526243347569 709
+    .quad 0x03FE0D6FC49F16E93
+    .quad 0x03FE0DC109458004A  # 0.526863374456 710
+    .quad 0x03FE0DC109458004A
+    .quad 0x03FE0DF73E353F0ED  # 0.527276939392 711
+    .quad 0x03FE0DF73E353F0ED
+    .quad 0x03FE0E4898611CCE1  # 0.527897607665 712
+    .quad 0x03FE0E4898611CCE1
+    .quad 0x03FE0E99FF7C20738  # 0.528518661406 713
+    .quad 0x03FE0E99FF7C20738
+    .quad 0x03FE0EEB738A67874  # 0.529140101094 714
+    .quad 0x03FE0EEB738A67874
+    .quad 0x03FE0F21C81D1ADC3  # 0.529554608872 715
+    .quad 0x03FE0F21C81D1ADC3
+    .quad 0x03FE0F7351C9FCD7F  # 0.530176692874 716
+    .quad 0x03FE0F7351C9FCD7F
+    .quad 0x03FE0FC4E875254C1  # 0.530799164104 717
+    .quad 0x03FE0FC4E875254C1
+    .quad 0x03FE10168C22B8FB9  # 0.531422023047 718
+    .quad 0x03FE10168C22B8FB9
+    .quad 0x03FE10683CD6DEA54  # 0.532045270185 719
+    .quad 0x03FE10683CD6DEA54
+    .quad 0x03FE109EB9E2E4C97  # 0.532460984179 720
+    .quad 0x03FE109EB9E2E4C97
+    .quad 0x03FE10F08055E7785  # 0.533084879385 721
+    .quad 0x03FE10F08055E7785
+    .quad 0x03FE114253DA97DA0  # 0.533709164079 722
+    .quad 0x03FE114253DA97DA0
+    .quad 0x03FE1194347523FDC  # 0.534333838748 723
+    .quad 0x03FE1194347523FDC
+    .quad 0x03FE11CAD1789B0F8  # 0.534750505421 724
+    .quad 0x03FE11CAD1789B0F8
+    .quad 0x03FE121CC7EB8F7E6  # 0.535375831132 725
+    .quad 0x03FE121CC7EB8F7E6
+    .quad 0x03FE126ECB7F8F007  # 0.536001548120 726
+    .quad 0x03FE126ECB7F8F007
+    .quad 0x03FE12A57FDA37091  # 0.536418910396 727
+    .quad 0x03FE12A57FDA37091
+    .quad 0x03FE12F799594EFBC  # 0.537045280601 728
+    .quad 0x03FE12F799594EFBC
+    .quad 0x03FE1349C004AFB00  # 0.537672043392 729
+    .quad 0x03FE1349C004AFB00
+    .quad 0x03FE139BF3E094003  # 0.538299199261 730
+    .quad 0x03FE139BF3E094003
+    .quad 0x03FE13D2C873C5E13  # 0.538717521794 731
+    .quad 0x03FE13D2C873C5E13
+    .quad 0x03FE142512549C16C  # 0.539345333889 732
+    .quad 0x03FE142512549C16C
+    .quad 0x03FE14776971477F1  # 0.539973540381 733
+    .quad 0x03FE14776971477F1
+    .quad 0x03FE14C9CDCE0A74D  # 0.540602141763 734
+    .quad 0x03FE14C9CDCE0A74D
+    .quad 0x03FE1500C2BFD1561  # 0.541021428981 735
+    .quad 0x03FE1500C2BFD1561
+    .quad 0x03FE15533D3B8D7B3  # 0.541650689621 736
+    .quad 0x03FE15533D3B8D7B3
+    .quad 0x03FE15A5C502C6DC5  # 0.542280346478 737
+    .quad 0x03FE15A5C502C6DC5
+    .quad 0x03FE15DCD1973457B  # 0.542700338085 738
+    .quad 0x03FE15DCD1973457B
+    .quad 0x03FE162F6F9071F76  # 0.543330656416 739
+    .quad 0x03FE162F6F9071F76
+    .quad 0x03FE16821AE0A13C6  # 0.543961372300 740
+    .quad 0x03FE16821AE0A13C6
+    .quad 0x03FE16B93F2C12808  # 0.544382070665 741
+    .quad 0x03FE16B93F2C12808
+    .quad 0x03FE170C00C169B51  # 0.545013450251 742
+    .quad 0x03FE170C00C169B51
+    .quad 0x03FE175ECFB935CC6  # 0.545645228728 743
+    .quad 0x03FE175ECFB935CC6
+    .quad 0x03FE17B1AC17CBD5B  # 0.546277406602 744
+    .quad 0x03FE17B1AC17CBD5B
+    .quad 0x03FE17E8F12052E8A  # 0.546699080654 745
+    .quad 0x03FE17E8F12052E8A
+    .quad 0x03FE183BE3DE8A7AF  # 0.547331925312 746
+    .quad 0x03FE183BE3DE8A7AF
+    .quad 0x03FE188EE40F23CA7  # 0.547965170715 747
+    .quad 0x03FE188EE40F23CA7
+    .quad 0x03FE18C640FF75F06  # 0.548387557205 748
+    .quad 0x03FE18C640FF75F06
+    .quad 0x03FE191957A30FA51  # 0.549021471648 749
+    .quad 0x03FE191957A30FA51
+    .quad 0x03FE196C7BC4B1F3A  # 0.549655788193 750
+    .quad 0x03FE196C7BC4B1F3A
+    .quad 0x03FE19A3F0B1860BD  # 0.550078889532 751
+    .quad 0x03FE19A3F0B1860BD
+    .quad 0x03FE19F72B59A0CEC  # 0.550713877383 752
+    .quad 0x03FE19F72B59A0CEC
+    .quad 0x03FE1A4A738B7A33C  # 0.551349268700 753
+    .quad 0x03FE1A4A738B7A33C
+    .quad 0x03FE1A820089A2156  # 0.551773087312 754
+    .quad 0x03FE1A820089A2156
+    .quad 0x03FE1AD55F55855C8  # 0.552409152212 755
+    .quad 0x03FE1AD55F55855C8
+    .quad 0x03FE1B28CBB6EC93E  # 0.553045621948 756
+    .quad 0x03FE1B28CBB6EC93E
+    .quad 0x03FE1B6070DB553D8  # 0.553470160269 757
+    .quad 0x03FE1B6070DB553D8
+    .quad 0x03FE1BB3F3EA714F6  # 0.554107305878 758
+    .quad 0x03FE1BB3F3EA714F6
+    .quad 0x03FE1BEBA8316EF2C  # 0.554532295260 759
+    .quad 0x03FE1BEBA8316EF2C
+    .quad 0x03FE1C3F41FA97C6B  # 0.555170118179 760
+    .quad 0x03FE1C3F41FA97C6B
+    .quad 0x03FE1C92E96C86020  # 0.555808348176 761
+    .quad 0x03FE1C92E96C86020
+    .quad 0x03FE1CCAB5FBFFEE1  # 0.556234061252 762
+    .quad 0x03FE1CCAB5FBFFEE1
+    .quad 0x03FE1D1E743BCFC47  # 0.556872970868 763
+    .quad 0x03FE1D1E743BCFC47
+    .quad 0x03FE1D72403052E75  # 0.557512288951 764
+    .quad 0x03FE1D72403052E75
+    .quad 0x03FE1DAA251D7E433  # 0.557938728190 765
+    .quad 0x03FE1DAA251D7E433
+    .quad 0x03FE1DFE07F3D1DAB  # 0.558578728212 766
+    .quad 0x03FE1DFE07F3D1DAB
+    .quad 0x03FE1E35FC265D75E  # 0.559005622562 767
+    .quad 0x03FE1E35FC265D75E
+    .quad 0x03FE1E89F5EB04126  # 0.559646305979 768
+    .quad 0x03FE1E89F5EB04126
+    .quad 0x03FE1EDDFD77E1FEF  # 0.560287400135 769
+    .quad 0x03FE1EDDFD77E1FEF
+    .quad 0x03FE1F160A2AD0DA3  # 0.560715024687 770
+    .quad 0x03FE1F160A2AD0DA3
+    .quad 0x03FE1F6A28BA1B476  # 0.561356804579 771
+    .quad 0x03FE1F6A28BA1B476
+    .quad 0x03FE1FBE551DB43C1  # 0.561998996616 772
+    .quad 0x03FE1FBE551DB43C1
+    .quad 0x03FE1FF67A6684F47  # 0.562427353873 773
+    .quad 0x03FE1FF67A6684F47
+    .quad 0x03FE204ABDE0BE5DF  # 0.563070233998 774
+    .quad 0x03FE204ABDE0BE5DF
+    .quad 0x03FE2082F29233211  # 0.563499050471 775
+    .quad 0x03FE2082F29233211
+    .quad 0x03FE20D74D2FBAFE4  # 0.564142620160 776
+    .quad 0x03FE20D74D2FBAFE4
+    .quad 0x03FE210F91524B469  # 0.564571896835 777
+    .quad 0x03FE210F91524B469
+    .quad 0x03FE2164031FDA0B0  # 0.565216157568 778
+    .quad 0x03FE2164031FDA0B0
+    .quad 0x03FE21B882DD26040  # 0.565860833641 779
+    .quad 0x03FE21B882DD26040
+    .quad 0x03FE21F0DFC65CEEC  # 0.566290848698 780
+    .quad 0x03FE21F0DFC65CEEC
+    .quad 0x03FE224576C81FFE0  # 0.566936218194 781
+    .quad 0x03FE224576C81FFE0
+    .quad 0x03FE227DE33896A44  # 0.567366696031 782
+    .quad 0x03FE227DE33896A44
+    .quad 0x03FE22D2918BA4A31  # 0.568012760445 783
+    .quad 0x03FE22D2918BA4A31
+    .quad 0x03FE23274DE272A83  # 0.568659242528 784
+    .quad 0x03FE23274DE272A83
+    .quad 0x03FE235FD33D232FC  # 0.569090462888 785
+    .quad 0x03FE235FD33D232FC
+    .quad 0x03FE23B4A6F9D8688  # 0.569737642287 786
+    .quad 0x03FE23B4A6F9D8688
+    .quad 0x03FE23ED3BF21CA33  # 0.570169328026 787
+    .quad 0x03FE23ED3BF21CA33
+    .quad 0x03FE24422721A89D7  # 0.570817206248 788
+    .quad 0x03FE24422721A89D7
+    .quad 0x03FE247ACBC023D2B  # 0.571249358372 789
+    .quad 0x03FE247ACBC023D2B
+    .quad 0x03FE24CFCE6F80D9B  # 0.571897936927 790
+    .quad 0x03FE24CFCE6F80D9B
+    .quad 0x03FE250882BCDD7D8  # 0.572330556445 791
+    .quad 0x03FE250882BCDD7D8
+    .quad 0x03FE255D9CF910A56  # 0.572979836849 792
+    .quad 0x03FE255D9CF910A56
+    .quad 0x03FE25B2C55CD5762  # 0.573629539091 793
+    .quad 0x03FE25B2C55CD5762
+    .quad 0x03FE25EB92D41992D  # 0.574062908546 794
+    .quad 0x03FE25EB92D41992D
+    .quad 0x03FE2640D2D99FFEA  # 0.574713315073 795
+    .quad 0x03FE2640D2D99FFEA
+    .quad 0x03FE2679B0166F51C  # 0.575147154559 796
+    .quad 0x03FE2679B0166F51C
+    .quad 0x03FE26CF07CAD8B00  # 0.575798266899 797
+    .quad 0x03FE26CF07CAD8B00
+    .quad 0x03FE2707F4D5F7C40  # 0.576232577438 798
+    .quad 0x03FE2707F4D5F7C40
+    .quad 0x03FE275D644670606  # 0.576884397124 799
+    .quad 0x03FE275D644670606
+    .quad 0x03FE27966128AB11B  # 0.577319179739 800
+    .quad 0x03FE27966128AB11B
+    .quad 0x03FE27EBE8626A387  # 0.577971708311 801
+    .quad 0x03FE27EBE8626A387
+    .quad 0x03FE2824F52493BD2  # 0.578406964030 802
+    .quad 0x03FE2824F52493BD2
+    .quad 0x03FE287A9434DBC7B  # 0.579060203030 803
+    .quad 0x03FE287A9434DBC7B
+    .quad 0x03FE28B3B0DFCEB80  # 0.579495932884 804
+    .quad 0x03FE28B3B0DFCEB80
+    .quad 0x03FE290967D3ED18D  # 0.580149883861 805
+    .quad 0x03FE290967D3ED18D
+    .quad 0x03FE294294708B773  # 0.580586088885 806
+    .quad 0x03FE294294708B773
+    .quad 0x03FE29986355D8C69  # 0.581240753393 807
+    .quad 0x03FE29986355D8C69
+    .quad 0x03FE29D19FED0C082  # 0.581677434622 808
+    .quad 0x03FE29D19FED0C082
+    .quad 0x03FE2A2786D0EC107  # 0.582332814220 809
+    .quad 0x03FE2A2786D0EC107
+    .quad 0x03FE2A60D36BA5253  # 0.582769972697 810
+    .quad 0x03FE2A60D36BA5253
+    .quad 0x03FE2AB6D25B86EF7  # 0.583426068948 811
+    .quad 0x03FE2AB6D25B86EF7
+    .quad 0x03FE2AF02F02BE4AB  # 0.583863705716 812
+    .quad 0x03FE2AF02F02BE4AB
+    .quad 0x03FE2B46460C1C2B3  # 0.584520520190 813
+    .quad 0x03FE2B46460C1C2B3
+    .quad 0x03FE2B7FB2C8D1CC1  # 0.584958636297 814
+    .quad 0x03FE2B7FB2C8D1CC1
+    .quad 0x03FE2BD5E1F9316F2  # 0.585616170568 815
+    .quad 0x03FE2BD5E1F9316F2
+    .quad 0x03FE2C0F5ED46CE8D  # 0.586054767066 816
+    .quad 0x03FE2C0F5ED46CE8D
+    .quad 0x03FE2C65A6395F5F5  # 0.586713022712 817
+    .quad 0x03FE2C65A6395F5F5
+    .quad 0x03FE2C9F333C2FE1E  # 0.587152100656 818
+    .quad 0x03FE2C9F333C2FE1E
+    .quad 0x03FE2CF592E351AE5  # 0.587811079263 819
+    .quad 0x03FE2CF592E351AE5
+    .quad 0x03FE2D2F3016CE0EF  # 0.588250639709 820
+    .quad 0x03FE2D2F3016CE0EF
+    .quad 0x03FE2D85A80DC7324  # 0.588910342867 821
+    .quad 0x03FE2D85A80DC7324
+    .quad 0x03FE2DBF557B0DF43  # 0.589350386878 822
+    .quad 0x03FE2DBF557B0DF43
+    .quad 0x03FE2E15E5CF91FA7  # 0.590010816181 823
+    .quad 0x03FE2E15E5CF91FA7
+    .quad 0x03FE2E4FA37FC9577  # 0.590451344823 824
+    .quad 0x03FE2E4FA37FC9577
+    .quad 0x03FE2E8967B3BF4E1  # 0.590892067615 825
+    .quad 0x03FE2E8967B3BF4E1
+    .quad 0x03FE2EE01A3BED567  # 0.591553516212 826
+    .quad 0x03FE2EE01A3BED567
+    .quad 0x03FE2F19EEBFB00BA  # 0.591994725131 827
+    .quad 0x03FE2F19EEBFB00BA
+    .quad 0x03FE2F70B9C67A7C2  # 0.592656903723 828
+    .quad 0x03FE2F70B9C67A7C2
+    .quad 0x03FE2FAA9EA342D04  # 0.593098599843 829
+    .quad 0x03FE2FAA9EA342D04
+    .quad 0x03FE3001823684D73  # 0.593761510043 830
+    .quad 0x03FE3001823684D73
+    .quad 0x03FE303B7775937EF  # 0.594203694441 831
+    .quad 0x03FE303B7775937EF
+    .quad 0x03FE309273A3340FC  # 0.594867337868 832
+    .quad 0x03FE309273A3340FC
+    .quad 0x03FE30CC794DD19D0  # 0.595310011625 833
+    .quad 0x03FE30CC794DD19D0
+    .quad 0x03FE3106858C76BB7  # 0.595752881428 834
+    .quad 0x03FE3106858C76BB7
+    .quad 0x03FE315DA4434068B  # 0.596417554101 835
+    .quad 0x03FE315DA4434068B
+    .quad 0x03FE3197C0FA80E6A  # 0.596860914783 836
+    .quad 0x03FE3197C0FA80E6A
+    .quad 0x03FE31EEF86D36EF1  # 0.597526324589 837
+    .quad 0x03FE31EEF86D36EF1
+    .quad 0x03FE322925A66E62D  # 0.597970177237 838
+    .quad 0x03FE322925A66E62D
+    .quad 0x03FE328075E32022F  # 0.598636325813 839
+    .quad 0x03FE328075E32022F
+    .quad 0x03FE32BAB3A7B21E9  # 0.599080671521 840
+    .quad 0x03FE32BAB3A7B21E9
+    .quad 0x03FE32F4F80D0B1BD  # 0.599525214760 841
+    .quad 0x03FE32F4F80D0B1BD
+    .quad 0x03FE334C6B15D30DD  # 0.600192400374 842
+    .quad 0x03FE334C6B15D30DD
+    .quad 0x03FE3386C013B90D6  # 0.600637438209 843
+    .quad 0x03FE3386C013B90D6
+    .quad 0x03FE33DE4C086C40A  # 0.601305366543 844
+    .quad 0x03FE33DE4C086C40A
+    .quad 0x03FE3418B1A85622C  # 0.601750900077 845
+    .quad 0x03FE3418B1A85622C
+    .quad 0x03FE34531DF21CFE3  # 0.602196632199 846
+    .quad 0x03FE34531DF21CFE3
+    .quad 0x03FE34AACCE299BA5  # 0.602865603124 847
+    .quad 0x03FE34AACCE299BA5
+    .quad 0x03FE34E549DBB21EF  # 0.603311832493 848
+    .quad 0x03FE34E549DBB21EF
+    .quad 0x03FE353D11DA4F855  # 0.603981550121 849
+    .quad 0x03FE353D11DA4F855
+    .quad 0x03FE35779F8C43D6D  # 0.604428277847 850
+    .quad 0x03FE35779F8C43D6D
+    .quad 0x03FE35B233F13DD4A  # 0.604875205229 851
+    .quad 0x03FE35B233F13DD4A
+    .quad 0x03FE360A1F1BBA738  # 0.605545971045 852
+    .quad 0x03FE360A1F1BBA738
+    .quad 0x03FE3644C446F97BC  # 0.605993398346 853
+    .quad 0x03FE3644C446F97BC
+    .quad 0x03FE367F702A9EA94  # 0.606441025927 854
+    .quad 0x03FE367F702A9EA94
+    .quad 0x03FE36D77E9D34FD7  # 0.607112843218 855
+    .quad 0x03FE36D77E9D34FD7
+    .quad 0x03FE37123B54987B7  # 0.607560972287 856
+    .quad 0x03FE37123B54987B7
+    .quad 0x03FE376A630C0A1D6  # 0.608233542652 857
+    .quad 0x03FE376A630C0A1D6
+    .quad 0x03FE37A530A0D5A31  # 0.608682174333 858
+    .quad 0x03FE37A530A0D5A31
+    .quad 0x03FE37E004F74E13B  # 0.609131007374 859
+    .quad 0x03FE37E004F74E13B
+    .quad 0x03FE383850278CFD9  # 0.609804634884 860
+    .quad 0x03FE383850278CFD9
+    .quad 0x03FE3873356902AB7  # 0.610253972119 861
+    .quad 0x03FE3873356902AB7
+    .quad 0x03FE38AE2171976E8  # 0.610703511349 862
+    .quad 0x03FE38AE2171976E8
+    .quad 0x03FE390690373AFFF  # 0.611378199331 863
+    .quad 0x03FE390690373AFFF
+    .quad 0x03FE39418D3872A53  # 0.611828244343 864
+    .quad 0x03FE39418D3872A53
+    .quad 0x03FE397C91064221F  # 0.612278491987 865
+    .quad 0x03FE397C91064221F
+    .quad 0x03FE39D5237E045A5  # 0.612954243787 866
+    .quad 0x03FE39D5237E045A5
+    .quad 0x03FE3A1038522CE82  # 0.613404998809 867
+    .quad 0x03FE3A1038522CE82
+    .quad 0x03FE3A68E45AD354B  # 0.614081512534 868
+    .quad 0x03FE3A68E45AD354B
+    .quad 0x03FE3AA40A3F2A68B  # 0.614532776080 869
+    .quad 0x03FE3AA40A3F2A68B
+    .quad 0x03FE3ADF36F98A182  # 0.614984243356 870
+    .quad 0x03FE3ADF36F98A182
+    .quad 0x03FE3B3806E5DF340  # 0.615661826668 871
+    .quad 0x03FE3B3806E5DF340
+    .quad 0x03FE3B7344BE40311  # 0.616113804077 872
+    .quad 0x03FE3B7344BE40311
+    .quad 0x03FE3BAE897234A87  # 0.616565985862 873
+    .quad 0x03FE3BAE897234A87
+    .quad 0x03FE3C077D5F51881  # 0.617244642149 874
+    .quad 0x03FE3C077D5F51881
+    .quad 0x03FE3C42D33F2AE7B  # 0.617697335683 875
+    .quad 0x03FE3C42D33F2AE7B
+    .quad 0x03FE3C7E30002960C  # 0.618150234241 876
+    .quad 0x03FE3C7E30002960C
+    .quad 0x03FE3CD7480B4A8A3  # 0.618829966906 877
+    .quad 0x03FE3CD7480B4A8A3
+    .quad 0x03FE3D12B60622748  # 0.619283378838 878
+    .quad 0x03FE3D12B60622748
+    .quad 0x03FE3D4E2AE7B7E2B  # 0.619736996447 879
+    .quad 0x03FE3D4E2AE7B7E2B
+    .quad 0x03FE3D89A6B1A558D  # 0.620190819917 880
+    .quad 0x03FE3D89A6B1A558D
+    .quad 0x03FE3DE2ED57B1F9B  # 0.620871941524 881
+    .quad 0x03FE3DE2ED57B1F9B
+    .quad 0x03FE3E1E7A6D8330E  # 0.621326280468 882
+    .quad 0x03FE3E1E7A6D8330E
+    .quad 0x03FE3E5A0E714DA6E  # 0.621780825931 883
+    .quad 0x03FE3E5A0E714DA6E
+    .quad 0x03FE3EB37978B85B6  # 0.622463031756 884
+    .quad 0x03FE3EB37978B85B6
+    .quad 0x03FE3EEF1ED68236B  # 0.622918094335 885
+    .quad 0x03FE3EEF1ED68236B
+    .quad 0x03FE3F2ACB27ED6C7  # 0.623373364090 886
+    .quad 0x03FE3F2ACB27ED6C7
+    .quad 0x03FE3F845AAE68C81  # 0.624056657591 887
+    .quad 0x03FE3F845AAE68C81
+    .quad 0x03FE3FC0186800514  # 0.624512446113 888
+    .quad 0x03FE3FC0186800514
+    .quad 0x03FE3FFBDD1AE8406  # 0.624968442473 889
+    .quad 0x03FE3FFBDD1AE8406
+    .quad 0x03FE4037A8C8C197A  # 0.625424646860 890
+    .quad 0x03FE4037A8C8C197A
+    .quad 0x03FE409167679DD99  # 0.626109343909 891
+    .quad 0x03FE409167679DD99
+    .quad 0x03FE40CD448FF6DD6  # 0.626566069196 892
+    .quad 0x03FE40CD448FF6DD6
+    .quad 0x03FE410928B8F950F  # 0.627023003177 893
+    .quad 0x03FE410928B8F950F
+    .quad 0x03FE41630C1B50AFF  # 0.627708795866 894
+    .quad 0x03FE41630C1B50AFF
+    .quad 0x03FE419F01CD27AD0  # 0.628166252416 895
+    .quad 0x03FE419F01CD27AD0
+    .quad 0x03FE41DAFE85672B9  # 0.628623918328 896
+    .quad 0x03FE41DAFE85672B9
+    .quad 0x03FE42170245B4C6A  # 0.629081793794 897
+    .quad 0x03FE42170245B4C6A
+    .quad 0x03FE42711518DF546  # 0.629769000326 898
+    .quad 0x03FE42711518DF546
+    .quad 0x03FE42AD2A74888A0  # 0.630227400518 899
+    .quad 0x03FE42AD2A74888A0
+    .quad 0x03FE42E946DE080C0  # 0.630686010936 900
+    .quad 0x03FE42E946DE080C0
+    .quad 0x03FE43437EB9D9424  # 0.631374321162 901
+    .quad 0x03FE43437EB9D9424
+    .quad 0x03FE437FACCD31C10  # 0.631833457993 902
+    .quad 0x03FE437FACCD31C10
+    .quad 0x03FE43BBE1F42FE09  # 0.632292805727 903
+    .quad 0x03FE43BBE1F42FE09
+    .quad 0x03FE43F81E307DE5E  # 0.632752364559 904
+    .quad 0x03FE43F81E307DE5E
+    .quad 0x03FE445285D68EA69  # 0.633442099038 905
+    .quad 0x03FE445285D68EA69
+    .quad 0x03FE448ED3CF71355  # 0.633902186463 906
+    .quad 0x03FE448ED3CF71355
+    .quad 0x03FE44CB28E37C3EE  # 0.634362485666 907
+    .quad 0x03FE44CB28E37C3EE
+    .quad 0x03FE450785145CAFE  # 0.634822996841 908
+    .quad 0x03FE450785145CAFE
+    .quad 0x03FE45621CB769366  # 0.635514161481 909
+    .quad 0x03FE45621CB769366
+    .quad 0x03FE459E8AB7B799D  # 0.635975203444 910
+    .quad 0x03FE459E8AB7B799D
+    .quad 0x03FE45DAFFDABD4DB  # 0.636436458065 911
+    .quad 0x03FE45DAFFDABD4DB
+    .quad 0x03FE46177C2229EC0  # 0.636897925539 912
+    .quad 0x03FE46177C2229EC0
+    .quad 0x03FE467243F53F69E  # 0.637590526283 913
+    .quad 0x03FE467243F53F69E
+    .quad 0x03FE46AED21F117FC  # 0.638052526753 914
+    .quad 0x03FE46AED21F117FC
+    .quad 0x03FE46EB677335D13  # 0.638514740766 915
+    .quad 0x03FE46EB677335D13
+    .quad 0x03FE472803F35EAAE  # 0.638977168520 916
+    .quad 0x03FE472803F35EAAE
+    .quad 0x03FE4764A7A13EF3B  # 0.639439810212 917
+    .quad 0x03FE4764A7A13EF3B
+    .quad 0x03FE47BFAA9F80271  # 0.640134174319 918
+    .quad 0x03FE47BFAA9F80271
+    .quad 0x03FE47FC60471DAF8  # 0.640597351724 919
+    .quad 0x03FE47FC60471DAF8
+    .quad 0x03FE48391D226992D  # 0.641060743762 920
+    .quad 0x03FE48391D226992D
+    .quad 0x03FE4875E1331971E  # 0.641524350631 921
+    .quad 0x03FE4875E1331971E
+    .quad 0x03FE48D114D3FB884  # 0.642220164181 922
+    .quad 0x03FE48D114D3FB884
+    .quad 0x03FE490DEAF1A3FC8  # 0.642684309003 923
+    .quad 0x03FE490DEAF1A3FC8
+    .quad 0x03FE494AC84AB0ED3  # 0.643148669355 924
+    .quad 0x03FE494AC84AB0ED3
+    .quad 0x03FE4987ACE0DABB0  # 0.643613245438 925
+    .quad 0x03FE4987ACE0DABB0
+    .quad 0x03FE49C498B5DA63F  # 0.644078037452 926
+    .quad 0x03FE49C498B5DA63F
+    .quad 0x03FE4A20080EF10B2  # 0.644775630783 927
+    .quad 0x03FE4A20080EF10B2
+    .quad 0x03FE4A5D060894B8C  # 0.645240963504 928
+    .quad 0x03FE4A5D060894B8C
+    .quad 0x03FE4A9A0B471A943  # 0.645706512861 929
+    .quad 0x03FE4A9A0B471A943
+    .quad 0x03FE4AD717CC3E626  # 0.646172279055 930
+    .quad 0x03FE4AD717CC3E626
+    .quad 0x03FE4B142B99BC871  # 0.646638262288 931
+    .quad 0x03FE4B142B99BC871
+    .quad 0x03FE4B6FD6F970C1F  # 0.647337644529 932
+    .quad 0x03FE4B6FD6F970C1F
+    .quad 0x03FE4BACFD036D080  # 0.647804171246 933
+    .quad 0x03FE4BACFD036D080
+    .quad 0x03FE4BEA2A5BDBE87  # 0.648270915712 934
+    .quad 0x03FE4BEA2A5BDBE87
+    .quad 0x03FE4C275F047C956  # 0.648737878130 935
+    .quad 0x03FE4C275F047C956
+    .quad 0x03FE4C649AFF0EE16  # 0.649205058703 936
+    .quad 0x03FE4C649AFF0EE16
+    .quad 0x03FE4CC082B46485A  # 0.649906239052 937
+    .quad 0x03FE4CC082B46485A
+    .quad 0x03FE4CFDD1037E37C  # 0.650373965908 938
+    .quad 0x03FE4CFDD1037E37C
+    .quad 0x03FE4D3B26AAADDD9  # 0.650841911635 939
+    .quad 0x03FE4D3B26AAADDD9
+    .quad 0x03FE4D7883ABB61F6  # 0.651310076438 940
+    .quad 0x03FE4D7883ABB61F6
+    .quad 0x03FE4DB5E8085A477  # 0.651778460521 941
+    .quad 0x03FE4DB5E8085A477
+    .quad 0x03FE4DF353C25E42B  # 0.652247064091 942
+    .quad 0x03FE4DF353C25E42B
+    .quad 0x03FE4E4F832C560DD  # 0.652950381434 943
+    .quad 0x03FE4E4F832C560DD
+    .quad 0x03FE4E8D015786F16  # 0.653419534621 944
+    .quad 0x03FE4E8D015786F16
+    .quad 0x03FE4ECA86E64A683  # 0.653888908016 945
+    .quad 0x03FE4ECA86E64A683
+    .quad 0x03FE4F0813DA673DD  # 0.654358501826 946
+    .quad 0x03FE4F0813DA673DD
+    .quad 0x03FE4F45A835A4E19  # 0.654828316258 947
+    .quad 0x03FE4F45A835A4E19
+    .quad 0x03FE4F8343F9CB678  # 0.655298351519 948
+    .quad 0x03FE4F8343F9CB678
+    .quad 0x03FE4FDFBB88A119A  # 0.656003818920 949
+    .quad 0x03FE4FDFBB88A119A
+    .quad 0x03FE501D69DADD660  # 0.656474407164 950
+    .quad 0x03FE501D69DADD660
+    .quad 0x03FE505B1F9C43ED7  # 0.656945216966 951
+    .quad 0x03FE505B1F9C43ED7
+    .quad 0x03FE5098DCCE9FABA  # 0.657416248534 952
+    .quad 0x03FE5098DCCE9FABA
+    .quad 0x03FE50D6A173BC425  # 0.657887502077 953
+    .quad 0x03FE50D6A173BC425
+    .quad 0x03FE51146D8D65F98  # 0.658358977805 954
+    .quad 0x03FE51146D8D65F98
+    .quad 0x03FE5152411D69C03  # 0.658830675927 955
+    .quad 0x03FE5152411D69C03
+    .quad 0x03FE51AF0C774A2D0  # 0.659538640558 956
+    .quad 0x03FE51AF0C774A2D0
+    .quad 0x03FE51ECF2B713F8A  # 0.660010895584 957
+    .quad 0x03FE51ECF2B713F8A
+    .quad 0x03FE522AE0738A3D8  # 0.660483373741 958
+    .quad 0x03FE522AE0738A3D8
+    .quad 0x03FE5268D5AE7CDCB  # 0.660956075239 959
+    .quad 0x03FE5268D5AE7CDCB
+    .quad 0x03FE52A6D269BC600  # 0.661429000289 960
+    .quad 0x03FE52A6D269BC600
+    .quad 0x03FE52E4D6A719F9B  # 0.661902149103 961
+    .quad 0x03FE52E4D6A719F9B
+    .quad 0x03FE5322E26867857  # 0.662375521893 962
+    .quad 0x03FE5322E26867857
+    .quad 0x03FE53800225BA6E2  # 0.663086001497 963
+    .quad 0x03FE53800225BA6E2
+    .quad 0x03FE53BE20B8DA502  # 0.663559935155 964
+    .quad 0x03FE53BE20B8DA502
+    .quad 0x03FE53FC46D64DDD1  # 0.664034093533 965
+    .quad 0x03FE53FC46D64DDD1
+    .quad 0x03FE543A747FE9ED6  # 0.664508476843 966
+    .quad 0x03FE543A747FE9ED6
+    .quad 0x03FE5478A9B78404C  # 0.664983085300 967
+    .quad 0x03FE5478A9B78404C
+    .quad 0x03FE54B6E67EF251C  # 0.665457919117 968
+    .quad 0x03FE54B6E67EF251C
+    .quad 0x03FE54F52AD80BAE9  # 0.665932978509 969
+    .quad 0x03FE54F52AD80BAE9
+    .quad 0x03FE553376C4A7A16  # 0.666408263689 970
+    .quad 0x03FE553376C4A7A16
+    .quad 0x03FE5571CA469E5C9  # 0.666883774872 971
+    .quad 0x03FE5571CA469E5C9
+    .quad 0x03FE55CF55C5A5437  # 0.667597465874 972
+    .quad 0x03FE55CF55C5A5437
+    .quad 0x03FE560DBC45153C7  # 0.668073543008 973
+    .quad 0x03FE560DBC45153C7
+    .quad 0x03FE564C2A6059FE7  # 0.668549846899 974
+    .quad 0x03FE564C2A6059FE7
+    .quad 0x03FE568AA0194EC6E  # 0.669026377763 975
+    .quad 0x03FE568AA0194EC6E
+    .quad 0x03FE56C91D71CF810  # 0.669503135817 976
+    .quad 0x03FE56C91D71CF810
+    .quad 0x03FE5707A26BB8C66  # 0.669980121278 977
+    .quad 0x03FE5707A26BB8C66
+    .quad 0x03FE57462F08E7DF5  # 0.670457334363 978
+    .quad 0x03FE57462F08E7DF5
+    .quad 0x03FE5784C34B3AC30  # 0.670934775289 979
+    .quad 0x03FE5784C34B3AC30
+    .quad 0x03FE57C35F3490183  # 0.671412444273 980
+    .quad 0x03FE57C35F3490183
+    .quad 0x03FE580202C6C7353  # 0.671890341535 981
+    .quad 0x03FE580202C6C7353
+    .quad 0x03FE5840AE03C0204  # 0.672368467291 982
+    .quad 0x03FE5840AE03C0204
+    .quad 0x03FE589EBD437CA31  # 0.673086084831 983
+    .quad 0x03FE589EBD437CA31
+    .quad 0x03FE58DD7BB392B30  # 0.673564782782 984
+    .quad 0x03FE58DD7BB392B30
+    .quad 0x03FE591C41D500163  # 0.674043709994 985
+    .quad 0x03FE591C41D500163
+    .quad 0x03FE595B0FA9A7EF1  # 0.674522866688 986
+    .quad 0x03FE595B0FA9A7EF1
+    .quad 0x03FE5999E5336E121  # 0.675002253082 987
+    .quad 0x03FE5999E5336E121
+    .quad 0x03FE59D8C2743705E  # 0.675481869398 988
+    .quad 0x03FE59D8C2743705E
+    .quad 0x03FE5A17A76DE803B  # 0.675961715857 989
+    .quad 0x03FE5A17A76DE803B
+    .quad 0x03FE5A56942266F7B  # 0.676441792678 990
+    .quad 0x03FE5A56942266F7B
+    .quad 0x03FE5A9588939A810  # 0.676922100084 991
+    .quad 0x03FE5A9588939A810
+    .quad 0x03FE5AD484C369F2D  # 0.677402638296 992
+    .quad 0x03FE5AD484C369F2D
+    .quad 0x03FE5B1388B3BD53E  # 0.677883407536 993
+    .quad 0x03FE5B1388B3BD53E
+    .quad 0x03FE5B5294667D5F7  # 0.678364408027 994
+    .quad 0x03FE5B5294667D5F7
+    .quad 0x03FE5B91A7DD93852  # 0.678845639990 995
+    .quad 0x03FE5B91A7DD93852
+    .quad 0x03FE5BD0C31AE9E9D  # 0.679327103649 996
+    .quad 0x03FE5BD0C31AE9E9D
+    .quad 0x03FE5C2F7A8ED5E5B  # 0.680049734055 997
+    .quad 0x03FE5C2F7A8ED5E5B
+    .quad 0x03FE5C6EA94431EF9  # 0.680531777930 998
+    .quad 0x03FE5C6EA94431EF9
+    .quad 0x03FE5CADDFC6874F5  # 0.681014054284 999
+    .quad 0x03FE5CADDFC6874F5
+    .quad 0x03FE5CED1E17C35C6  # 0.681496563340 1000
+    .quad 0x03FE5CED1E17C35C6
+    .quad 0x03FE5D2C6439D4252  # 0.681979305324 1001
+    .quad 0x03FE5D2C6439D4252
+    .quad 0x03FE5D6BB22EA86F6  # 0.682462280460 1002
+    .quad 0x03FE5D6BB22EA86F6
+    .quad 0x03FE5DAB07F82FB84  # 0.682945488974 1003
+    .quad 0x03FE5DAB07F82FB84
+    .quad 0x03FE5DEA65985A350  # 0.683428931091 1004
+    .quad 0x03FE5DEA65985A350
+    .quad 0x03FE5E29CB1118D32  # 0.683912607038 1005
+    .quad 0x03FE5E29CB1118D32
+    .quad 0x03FE5E6938645D390  # 0.684396517040 1006
+    .quad 0x03FE5E6938645D390
+    .quad 0x03FE5EA8AD9419C5B  # 0.684880661324 1007
+    .quad 0x03FE5EA8AD9419C5B
+    .quad 0x03FE5EE82AA241920  # 0.685365040118 1008
+    .quad 0x03FE5EE82AA241920
+    .quad 0x03FE5F27AF90C8705  # 0.685849653648 1009
+    .quad 0x03FE5F27AF90C8705
+    .quad 0x03FE5F673C61A2ED2  # 0.686334502142 1010
+    .quad 0x03FE5F673C61A2ED2
+    .quad 0x03FE5FA6D116C64F7  # 0.686819585829 1011
+    .quad 0x03FE5FA6D116C64F7
+    .quad 0x03FE5FE66DB228992  # 0.687304904936 1012
+    .quad 0x03FE5FE66DB228992
+    .quad 0x03FE60261235C0874  # 0.687790459692 1013
+    .quad 0x03FE60261235C0874
+    .quad 0x03FE6065BEA385926  # 0.688276250325 1014
+    .quad 0x03FE6065BEA385926
+    .quad 0x03FE60A572FD6FEF1  # 0.688762277066 1015
+    .quad 0x03FE60A572FD6FEF1
+    .quad 0x03FE60E52F45788E4  # 0.689248540144 1016
+    .quad 0x03FE60E52F45788E4
+    .quad 0x03FE6124F37D991D4  # 0.689735039789 1017
+    .quad 0x03FE6124F37D991D4
+    .quad 0x03FE6164BFA7CC06C  # 0.690221776231 1018
+    .quad 0x03FE6164BFA7CC06C
+    .quad 0x03FE61A493C60C729  # 0.690708749700 1019
+    .quad 0x03FE61A493C60C729
+    .quad 0x03FE61E46FDA56466  # 0.691195960429 1020
+    .quad 0x03FE61E46FDA56466
+    .quad 0x03FE622453E6A6263  # 0.691683408647 1021
+    .quad 0x03FE622453E6A6263
+    .quad 0x03FE62643FECF9743  # 0.692171094587 1022
+    .quad 0x03FE62643FECF9743
+    .quad 0x03FE62A433EF4E51A  # 0.692659018480 1023
+    .quad 0x03FE62A433EF4E51A

diff --git a/src/gas/vrda_scaledshifted_logr.S b/src/gas/vrda_scaledshifted_logr.S
new file mode 100644
index 0000000..960460d
--- /dev/null
+++ b/src/gas/vrda_scaledshifted_logr.S

@@ -0,0 +1,2451 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrda_scaledshifted_logr.s
+#
+# An array implementation of the log libm function.
+#  Adapted to provide a scalingi and shifting factor.  This routine is
+#  used by the ACML RNG distribution functions.
+#
+# Prototype:
+#
+#    void vrda_scaledshifted_logr(int n, double *x, double *y, double b,double a);
+#
+#   Computes the natural log of x multiplied by b, plus a.
+# A reduced precision routine.   Uses the intel novel reduction technique
+# with frcpai to compute logs.
+#  Also uses only 3 polynomial terms to acheive52-18= 34 significant digits
+#
+#   This specialized routine does not handle negative numbers, 0, NaNs, or infinity.
+#   This routine is not C99 compliant
+#   This version can compute logs in 26
+#   cycles with n <= 24
+#
+#
+
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+.equ	p_idx,0x010		# index storage
+.equ	p_xexp,0x020		# index storage
+
+.equ	p_x2,0x030		# temporary for error checking operation
+.equ	p_idx2,0x040		# index storage
+.equ	p_xexp2,0x050		# index storage
+
+.equ	save_xa,0x060		#qword
+.equ	save_ya,0x068		#qword
+.equ	save_nv,0x070		#qword
+.equ	p_iter,0x078		# qword	storage for number of loop iterations
+
+
+
+.equ	p2_temp,0x090		# second temporary for get/put bits operation
+
+
+
+.equ	stack_size,0x0e8
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+	.weak vrda_scaledshifted_logr__
+	.set vrda_scaledshifted_logr__,__vrda_scaledshifted_logr__
+	.weak vrda_scaledshifted_logr_
+	.set vrda_scaledshifted_logr_,__vrda_scaledshifted_logr__
+
+# Fortran interface parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+# rcx - double *b
+# r8 - double *a
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array log
+#**     VRDA_LOG(N,X,Y)
+# C equivalent*/
+#void vrda_scaledshifted_logr__(int * n, double *x, double *y,double *b, double *a)
+#{
+#       vrda_scaledshifted_logr(*n,x,y);
+#}
+.globl __vrda_scaledshifted_logr__
+    .type   __vrda_scaledshifted_logr__,@function
+__vrda_scaledshifted_logr__:
+    mov		(%rdi),%edi
+    movlpd	(%rcx),%xmm0
+    movlpd	(%r8),%xmm1
+
+# C interface parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+# xmm0 - double b
+# xmm1 - double a
+
+    .align 16
+    .p2align 4,,15
+.globl vrda_scaledshifted_logr
+    .type   vrda_scaledshifted_logr,@function
+vrda_scaledshifted_logr:
+	sub		$stack_size,%rsp
+
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+# move the scale and shift factor to another register
+	movsd		%xmm0,%xmm10
+	unpcklpd	%xmm10,%xmm10
+	movsd		%xmm1,%xmm11
+	unpcklpd	%xmm11,%xmm11
+
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$2,%rax			# get number of iterations
+	jz		.L__vda_cleanup		# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi		# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movlpd	(%rsi),%xmm0
+	movhpd	8(%rsi),%xmm0
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+                movlpd  -16(%rsi),%xmm1
+                movhpd  -8(%rsi),%xmm1
+
+# compute the logs
+
+#	movdqa	%xmm0,p_x(%rsp)	# save the input values
+
+# use the  algorithm referenced in the itanic trancendental paper.
+
+# reduction
+#  compute r = x frcpa(x) - 1
+        movdqa  %xmm0,%xmm8
+        movdqa  %xmm1,%xmm9
+
+        call    __vrd4_frcpa@PLT
+        movdqa  %xmm8,%xmm4
+                movdqa  %xmm9,%xmm7
+# invert the exponent
+        psllq   $1,%xmm8
+                psllq   $1,%xmm9
+        mulpd   %xmm0,%xmm4                             # r
+                mulpd   %xmm1,%xmm7                     # r
+        movdqa  %xmm8,%xmm5
+        paddq   .L__mask_rup(%rip),%xmm8
+        psrlq   $53,%xmm8
+                movdqa  %xmm9,%xmm6
+                paddq   .L__mask_rup(%rip),%xmm6
+                psrlq   $53,%xmm6
+        psubq   .L__mask_3ff(%rip),%xmm8
+                psubq   .L__mask_3ff(%rip),%xmm6
+        pshufd  $0x058,%xmm8,%xmm8
+                pshufd  $0x058,%xmm6,%xmm6
+
+
+        subpd   .L__real_one(%rip),%xmm4
+                subpd   .L__real_one(%rip),%xmm7
+
+        cvtdq2pd        %xmm8,%xmm0             #N
+                cvtdq2pd        %xmm6,%xmm1             #N
+#	movdqa	%xmm8,%xmm0
+#	movdqa	%xmm6,%xmm1
+# compute index for table lookup. if 1/2 bit set, increment the index+exponent
+        psrlq   $42,%xmm5
+                psrlq   $42,%xmm9
+        paddq   .L__int_one(%rip),%xmm5
+                paddq   .L__int_one(%rip),%xmm9
+        psrlq   $1,%xmm5
+                psrlq   $1,%xmm9
+        pand    .L__mask_3ff(%rip),%xmm5
+                pand    .L__mask_3ff(%rip),%xmm9
+        psllq   $1,%xmm5
+                psllq   $1,%xmm9
+
+        movdqa  %xmm5,p_x(%rsp) # move the indexes to a memory location
+                movdqa  %xmm9,p_x2(%rsp)
+
+
+        movapd  .L__real_third(%rip),%xmm3
+                movdqa  %xmm3,%xmm5
+        movapd  %xmm4,%xmm2
+                movapd  %xmm7,%xmm8
+
+# approximation
+#  compute the polynomial
+#   p(r) = p1r^2+p2r^3+p3r^4+p4r^5
+
+        mulpd   %xmm4,%xmm2                     #r^2
+                mulpd   %xmm7,%xmm8                     #r^2
+
+        mulpd   %xmm4,%xmm3                     # 1/3r
+                mulpd   %xmm7,%xmm5                     # 1/3r
+# lookup the f(k) term
+        lea             .L__np_lnf_table(%rip),%rdx
+        mov             p_x(%rsp),%rcx
+        mov             p_x+8(%rsp),%r9
+        movlpd          (%rdx,%rcx,8),%xmm6     # lookup
+        movhpd          (%rdx,%r9,8),%xmm6      # lookup
+
+        addpd   .L__real_half(%rip),%xmm3  # p2 + p3r
+                addpd   .L__real_half(%rip),%xmm5  # p2 + p3r
+
+                mov             p_x2(%rsp),%rcx
+                mov             p_x2+8(%rsp),%r9
+                movlpd          (%rdx,%rcx,8),%xmm9     # lookup
+                movhpd          (%rdx,%r9,8),%xmm9      # lookup
+
+        mulpd   %xmm3,%xmm2                     # r2(p2 + p3r)
+                mulpd   %xmm5,%xmm8                     # r2(p2 + p3r)
+        addpd   %xmm4,%xmm2                     # +r
+                addpd   %xmm7,%xmm8                     # +r
+
+
+#       reconstruction
+#  compute ln(x) = T + r + p(r) where
+#   T = N*ln(2)+ln(1/frcpa(x)) via tab of ln(1/frcpa(y)), where y = 1 + k/256, 0<=k<=255
+
+        mulpd   .L__real_log2(%rip),%xmm0        # compute  N*__real_log2
+                mulpd   .L__real_log2(%rip),%xmm1        # compute  N*__real_log2
+        addpd   %xmm6,%xmm2     # add the new mantissas
+                addpd   %xmm9,%xmm8     # add the new mantissas
+        addpd   %xmm2,%xmm0
+                addpd   %xmm8,%xmm1
+
+
+# store the result _m128d
+	mov		save_ya(%rsp),%rdi	# get y_array pointer
+	mulpd	%xmm10,%xmm0
+	addpd	%xmm11,%xmm0
+	movlpd	%xmm0,(%rdi)
+	movhpd	 %xmm0,8(%rdi)
+
+
+	prefetch	64(%rdi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+		mulpd	%xmm10,%xmm1
+		addpd	%xmm11,%xmm1
+		movlpd	%xmm1,-16(%rdi)
+		movhpd	%xmm1,-8(%rdi)
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vda_top
+
+
+# see if we need to do any extras
+	mov		save_nv(%rsp),%rax	# get number of values
+	test		%rax,%rax
+	jnz		.L__vda_cleanup
+
+
+.L__finish:
+	add		$stack_size,%rsp
+	ret
+
+	.align	16
+
+
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+#  we assume that rdx is pointing at the next x array element,
+#  r8 at the next y array element.  The number of values left is in
+#  save_nv
+.L__vda_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__finish              # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorpd		%xmm0,%xmm0
+	movlpd	%xmm0,p_x+8(%rsp)
+	movapd	%xmm0,p_x+16(%rsp)
+
+	mov		(%rsi),%rcx			# we know there's at least one
+	mov	 	%rcx,p_x(%rsp)
+	cmp		$2,%rax
+	jl		.L__vdacg
+
+	mov		8(%rsi),%rcx			# do the second value
+	mov	 	%rcx,p_x+8(%rsp)
+	cmp		$3,%rax
+	jl		.L__vdacg
+
+	mov		16(%rsi),%rcx			# do the third value
+	mov	 	%rcx,p_x+16(%rsp)
+
+.L__vdacg:
+	mov		$4,%rdi				# parameter for N
+	lea		p_x(%rsp),%rsi		# &x parameter
+	lea		p2_temp(%rsp),%rdx	# &y parameter
+        movsd           %xmm10,%xmm0
+        movsd           %xmm11,%xmm1
+	call		vrda_scaledshifted_logr@PLT		# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		save_ya(%rsp),%rdi
+	mov		save_nv(%rsp),%rax	# get number of values
+	mov	 	p2_temp(%rsp),%rcx
+	mov		%rcx,(%rdi)			# we know there's at least one
+	cmp		$2,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+8(%rsp),%rcx
+	mov		%rcx,8(%rdi)			# do the second value
+	cmp		$3,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+16(%rsp),%rcx
+	mov		%rcx,16(%rdi)			# do the third value
+
+.L__vdacgf:
+	jmp		.L__finish
+
+	.data
+	.align	64
+
+.L__real_one:		.quad 0x03ff0000000000000	# 1.0
+			.quad 0x03ff0000000000000
+
+.L__real_half:		.quad 0x0bfe0000000000000	# 1/2
+			.quad 0x0bfe0000000000000
+.L__real_third:		.quad 0x03fd5555555555555	# 1/3
+			.quad 0x03fd5555555555555
+.L__real_fourth:	.quad 0x0bfd0000000000000	# 1/4
+			.quad 0x0bfd0000000000000
+.L__real_fifth:		.quad 0x03fc999999999999a	# 1/5
+			.quad 0x03fc999999999999a
+.L__real_sixth:		.quad 0x0bfc5555555555555	# 1/6
+			.quad 0x0bfc5555555555555
+
+.L__real_log2:	        .quad 0x03FE62E42FEFA39EF  # 0.693147182465
+		        .quad 0x03FE62E42FEFA39EF
+
+.L__mask_3ff:		.quad 0x000000000000003ff	#
+			.quad 0x000000000000003ff
+
+.L__mask_rup:           .quad 0x0000003fffffffffe
+                        .quad 0x0000003fffffffffe
+
+.L__int_one:		.quad 0x00000000000000001
+			.quad 0x00000000000000001
+
+
+
+.L__mask_10bits:	.quad 0x000000000000003ff
+			.quad 0x000000000000003ff
+
+.L__mask_expext:	.quad 0x000000000003ff000
+			.quad 0x000000000003ff000
+
+.L__mask_expext2:	.quad 0x000000000003ff800
+			.quad 0x000000000003ff800
+
+
+
+
+.L__np_lnf_table:
+#log table Program - logtab.c
+#Built Jan 18 2006  09:51:57
+#Compiler version  1400
+
+    .quad 0x00000000000000000  # 0.000000000000 0
+    .quad 0x00000000000000000
+    .quad 0x03F50020055655885  # 0.000977039648 1
+    .quad 0x03F50020055655885
+    .quad 0x03F60040155D5881E  # 0.001955034836 2
+    .quad 0x03F60040155D5881E
+    .quad 0x03F6809048289860A  # 0.002933987435 3
+    .quad 0x03F6809048289860A
+    .quad 0x03F70080559588B25  # 0.003913899321 4
+    .quad 0x03F70080559588B25
+    .quad 0x03F740C8A7478788D  # 0.004894772377 5
+    .quad 0x03F740C8A7478788D
+    .quad 0x03F78121214586B02  # 0.005876608489 6
+    .quad 0x03F78121214586B02
+    .quad 0x03F7C189CBB0E283F  # 0.006859409551 7
+    .quad 0x03F7C189CBB0E283F
+    .quad 0x03F8010157588DE69  # 0.007843177461 8
+    .quad 0x03F8010157588DE69
+    .quad 0x03F82145E939EF1BC  # 0.008827914124 9
+    .quad 0x03F82145E939EF1BC
+    .quad 0x03F83D8896A83D7A8  # 0.009690354884 10
+    .quad 0x03F83D8896A83D7A8
+    .quad 0x03F85DDC705054DFF  # 0.010676913110 11
+    .quad 0x03F85DDC705054DFF
+    .quad 0x03F87E38762CA0C6D  # 0.011664445593 12
+    .quad 0x03F87E38762CA0C6D
+    .quad 0x03F89E9CAC6007563  # 0.012652954261 13
+    .quad 0x03F89E9CAC6007563
+    .quad 0x03F8BF091710935A4  # 0.013642441046 14
+    .quad 0x03F8BF091710935A4
+    .quad 0x03F8DF7DBA6777895  # 0.014632907884 15
+    .quad 0x03F8DF7DBA6777895
+    .quad 0x03F8FBEA8B13C03F9  # 0.015500371846 16
+    .quad 0x03F8FBEA8B13C03F9
+    .quad 0x03F90E3751F24F45C  # 0.016492681528 17
+    .quad 0x03F90E3751F24F45C
+    .quad 0x03F91E7D80B1FBF4C  # 0.017485976867 18
+    .quad 0x03F91E7D80B1FBF4C
+    .quad 0x03F92CBE4F6CC56C3  # 0.018355920375 19
+    .quad 0x03F92CBE4F6CC56C3
+    .quad 0x03F93D0C443D7258C  # 0.019351069108 20
+    .quad 0x03F93D0C443D7258C
+    .quad 0x03F94D5E6176ACC89  # 0.020347209148 21
+    .quad 0x03F94D5E6176ACC89
+    .quad 0x03F95DB4A937DEF10  # 0.021344342472 22
+    .quad 0x03F95DB4A937DEF10
+    .quad 0x03F96C039490E37F4  # 0.022217650494 23
+    .quad 0x03F96C039490E37F4
+    .quad 0x03F97C61B1CF5DED7  # 0.023216651576 24
+    .quad 0x03F97C61B1CF5DED7
+    .quad 0x03F98AB77B3FD6EAD  # 0.024091596947 25
+    .quad 0x03F98AB77B3FD6EAD
+    .quad 0x03F99B1D75828E780  # 0.025092472797 26
+    .quad 0x03F99B1D75828E780
+    .quad 0x03F9AB87A478CB7CB  # 0.026094351403 27
+    .quad 0x03F9AB87A478CB7CB
+    .quad 0x03F9B9E8027E1916F  # 0.026971819338 28
+    .quad 0x03F9B9E8027E1916F
+    .quad 0x03F9CA5A1A18613E6  # 0.027975583538 29
+    .quad 0x03F9CA5A1A18613E6
+    .quad 0x03F9D8C1670325921  # 0.028854704473 30
+    .quad 0x03F9D8C1670325921
+    .quad 0x03F9E93B6EE41F674  # 0.029860361378 31
+    .quad 0x03F9E93B6EE41F674
+    .quad 0x03F9F7A9B16782855  # 0.030741141554 32
+    .quad 0x03F9F7A9B16782855
+    .quad 0x03FA0415D89E74440  # 0.031748698315 33
+    .quad 0x03FA0415D89E74440
+    .quad 0x03FA0C58FA19DFAAB  # 0.032757271269 34
+    .quad 0x03FA0C58FA19DFAAB
+    .quad 0x03FA139577CC41C1A  # 0.033640607815 35
+    .quad 0x03FA139577CC41C1A
+    .quad 0x03FA1AD398C6CD57C  # 0.034524725334 36
+    .quad 0x03FA1AD398C6CD57C
+    .quad 0x03FA231C9C40E204E  # 0.035536103423 37
+    .quad 0x03FA231C9C40E204E
+    .quad 0x03FA2A5E4231CF7BD  # 0.036421899115 38
+    .quad 0x03FA2A5E4231CF7BD
+    .quad 0x03FA32AB4D4C59CB0  # 0.037435198758 39
+    .quad 0x03FA32AB4D4C59CB0
+    .quad 0x03FA39F07BA0EBD5A  # 0.038322679007 40
+    .quad 0x03FA39F07BA0EBD5A
+    .quad 0x03FA424192495D571  # 0.039337907520 41
+    .quad 0x03FA424192495D571
+    .quad 0x03FA498A4C73DA65D  # 0.040227078744 42
+    .quad 0x03FA498A4C73DA65D
+    .quad 0x03FA50D4AF75CA86F  # 0.041117041297 43
+    .quad 0x03FA50D4AF75CA86F
+    .quad 0x03FA592BBC15215BC  # 0.042135112141 44
+    .quad 0x03FA592BBC15215BC
+    .quad 0x03FA6079B00423FF6  # 0.043026775152 45
+    .quad 0x03FA6079B00423FF6
+    .quad 0x03FA67C94F2D4BB65  # 0.043919233935 46
+    .quad 0x03FA67C94F2D4BB65
+    .quad 0x03FA70265A550E77B  # 0.044940163069 47
+    .quad 0x03FA70265A550E77B
+    .quad 0x03FA77798F8D6DFDC  # 0.045834331871 48
+    .quad 0x03FA77798F8D6DFDC
+    .quad 0x03FA7ECE7267CD123  # 0.046729300926 49
+    .quad 0x03FA7ECE7267CD123
+    .quad 0x03FA873184BC09586  # 0.047753104446 50
+    .quad 0x03FA873184BC09586
+    .quad 0x03FA8E8A02D2E3175  # 0.048649793163 51
+    .quad 0x03FA8E8A02D2E3175
+    .quad 0x03FA95E430F8CE456  # 0.049547286652 52
+    .quad 0x03FA95E430F8CE456
+    .quad 0x03FA9D400FF482586  # 0.050445586359 53
+    .quad 0x03FA9D400FF482586
+    .quad 0x03FAA5AB21CB34A9E  # 0.051473203662 54
+    .quad 0x03FAA5AB21CB34A9E
+    .quad 0x03FAAD0AA2E784EF4  # 0.052373235867 55
+    .quad 0x03FAAD0AA2E784EF4
+    .quad 0x03FAB46BD74DA76A0  # 0.053274078860 56
+    .quad 0x03FAB46BD74DA76A0
+    .quad 0x03FABBCEBFC68F424  # 0.054175734102 57
+    .quad 0x03FABBCEBFC68F424
+    .quad 0x03FAC3335D1BBAE4D  # 0.055078203060 58
+    .quad 0x03FAC3335D1BBAE4D
+    .quad 0x03FACBA87200EB8F1  # 0.056110594428 59
+    .quad 0x03FACBA87200EB8F1
+    .quad 0x03FAD310BA20455A2  # 0.057014812019 60
+    .quad 0x03FAD310BA20455A2
+    .quad 0x03FADA7AB998B77ED  # 0.057919847959 61
+    .quad 0x03FADA7AB998B77ED
+    .quad 0x03FAE1E6713606CFB  # 0.058825703731 62
+    .quad 0x03FAE1E6713606CFB
+    .quad 0x03FAE953E1C48603A  # 0.059732380822 63
+    .quad 0x03FAE953E1C48603A
+    .quad 0x03FAF0C30C1116351  # 0.060639880722 64
+    .quad 0x03FAF0C30C1116351
+    .quad 0x03FAF833F0E927711  # 0.061548204926 65
+    .quad 0x03FAF833F0E927711
+    .quad 0x03FAFFA6911AB9309  # 0.062457354934 66
+    .quad 0x03FAFFA6911AB9309
+    .quad 0x03FB038D76BA2D737  # 0.063367332247 67
+    .quad 0x03FB038D76BA2D737
+    .quad 0x03FB0748836296412  # 0.064278138373 68
+    .quad 0x03FB0748836296412
+    .quad 0x03FB0B046EEE6F7A4  # 0.065189774824 69
+    .quad 0x03FB0B046EEE6F7A4
+    .quad 0x03FB0EC139C5DA5FD  # 0.066102243114 70
+    .quad 0x03FB0EC139C5DA5FD
+    .quad 0x03FB127EE451413A8  # 0.067015544762 71
+    .quad 0x03FB127EE451413A8
+    .quad 0x03FB163D6EF9579FC  # 0.067929681294 72
+    .quad 0x03FB163D6EF9579FC
+    .quad 0x03FB19FCDA271ABC0  # 0.068844654235 73
+    .quad 0x03FB19FCDA271ABC0
+    .quad 0x03FB1DBD2643D1912  # 0.069760465119 74
+    .quad 0x03FB1DBD2643D1912
+    .quad 0x03FB217E53B90D3CE  # 0.070677115481 75
+    .quad 0x03FB217E53B90D3CE
+    .quad 0x03FB254062F0A9417  # 0.071594606862 76
+    .quad 0x03FB254062F0A9417
+    .quad 0x03FB29035454CBCB0  # 0.072512940806 77
+    .quad 0x03FB29035454CBCB0
+    .quad 0x03FB2CC7284FE5F1A  # 0.073432118863 78
+    .quad 0x03FB2CC7284FE5F1A
+    .quad 0x03FB308BDF4CB4062  # 0.074352142586 79
+    .quad 0x03FB308BDF4CB4062
+    .quad 0x03FB345179B63DD3F  # 0.075273013532 80
+    .quad 0x03FB345179B63DD3F
+    .quad 0x03FB3817F7F7D6EAB  # 0.076194733263 81
+    .quad 0x03FB3817F7F7D6EAB
+    .quad 0x03FB3BDF5A7D1EE5E  # 0.077117303344 82
+    .quad 0x03FB3BDF5A7D1EE5E
+    .quad 0x03FB3F1D405CE86D3  # 0.077908755701 83
+    .quad 0x03FB3F1D405CE86D3
+    .quad 0x03FB42E64BEC266E4  # 0.078832909176 84
+    .quad 0x03FB42E64BEC266E4
+    .quad 0x03FB46B03CF437BC4  # 0.079757917501 85
+    .quad 0x03FB46B03CF437BC4
+    .quad 0x03FB4A7B13E1E3E65  # 0.080683782259 86
+    .quad 0x03FB4A7B13E1E3E65
+    .quad 0x03FB4E46D1223FE84  # 0.081610505036 87
+    .quad 0x03FB4E46D1223FE84
+    .quad 0x03FB52137522AE732  # 0.082538087426 88
+    .quad 0x03FB52137522AE732
+    .quad 0x03FB5555DE434F2A0  # 0.083333843436 89
+    .quad 0x03FB5555DE434F2A0
+    .quad 0x03FB59242FF043D34  # 0.084263026485 90
+    .quad 0x03FB59242FF043D34
+    .quad 0x03FB5CF36997817B2  # 0.085193073719 91
+    .quad 0x03FB5CF36997817B2
+    .quad 0x03FB60C38BA799459  # 0.086123986746 92
+    .quad 0x03FB60C38BA799459
+    .quad 0x03FB6408F471C82A2  # 0.086922602521 93
+    .quad 0x03FB6408F471C82A2
+    .quad 0x03FB67DAC7466CB96  # 0.087855127734 94
+    .quad 0x03FB67DAC7466CB96
+    .quad 0x03FB6BAD83C1883BA  # 0.088788523361 95
+    .quad 0x03FB6BAD83C1883BA
+    .quad 0x03FB6EF528C056A2D  # 0.089589270768 96
+    .quad 0x03FB6EF528C056A2D
+    .quad 0x03FB72C9985035BB1  # 0.090524287199 97
+    .quad 0x03FB72C9985035BB1
+    .quad 0x03FB769EF2C6B5688  # 0.091460178704 98
+    .quad 0x03FB769EF2C6B5688
+    .quad 0x03FB79E8D70A364C6  # 0.092263069152 99
+    .quad 0x03FB79E8D70A364C6
+    .quad 0x03FB7DBFE6EA733FE  # 0.093200590148 100
+    .quad 0x03FB7DBFE6EA733FE
+    .quad 0x03FB8197E2F40E3F0  # 0.094138990914 101
+    .quad 0x03FB8197E2F40E3F0
+    .quad 0x03FB84E40992A4804  # 0.094944035906 102
+    .quad 0x03FB84E40992A4804
+    .quad 0x03FB88BDBD5FC66D2  # 0.095884074919 103
+    .quad 0x03FB88BDBD5FC66D2
+    .quad 0x03FB8C985E9B9EC7E  # 0.096824998438 104
+    .quad 0x03FB8C985E9B9EC7E
+    .quad 0x03FB8FE6CAB20E979  # 0.097632209567 105
+    .quad 0x03FB8FE6CAB20E979
+    .quad 0x03FB93C3261014C65  # 0.098574780162 106
+    .quad 0x03FB93C3261014C65
+    .quad 0x03FB97130DC9235DE  # 0.099383405543 107
+    .quad 0x03FB97130DC9235DE
+    .quad 0x03FB9AF124D64C623  # 0.100327628989 108
+    .quad 0x03FB9AF124D64C623
+    .quad 0x03FB9E4289871E964  # 0.101137673586 109
+    .quad 0x03FB9E4289871E964
+    .quad 0x03FBA2225DD276FCB  # 0.102083555691 110
+    .quad 0x03FBA2225DD276FCB
+    .quad 0x03FBA57540D1FE441  # 0.102895024494 111
+    .quad 0x03FBA57540D1FE441
+    .quad 0x03FBA956D3ECADE60  # 0.103842571097 112
+    .quad 0x03FBA956D3ECADE60
+    .quad 0x03FBACAB3693AB9C0  # 0.104655469123 113
+    .quad 0x03FBACAB3693AB9C0
+    .quad 0x03FBB08E8A10F96F4  # 0.105604686090 114
+    .quad 0x03FBB08E8A10F96F4
+    .quad 0x03FBB3E46DBA02181  # 0.106419018383 115
+    .quad 0x03FBB3E46DBA02181
+    .quad 0x03FBB7C9832F58018  # 0.107369911615 116
+    .quad 0x03FBB7C9832F58018
+    .quad 0x03FBBB20E936D6976  # 0.108185683244 117
+    .quad 0x03FBBB20E936D6976
+    .quad 0x03FBBF07C23BC54EA  # 0.109138258671 118
+    .quad 0x03FBBF07C23BC54EA
+    .quad 0x03FBC260ABFFFE972  # 0.109955474734 119
+    .quad 0x03FBC260ABFFFE972
+    .quad 0x03FBC6494A2E418A0  # 0.110909738320 120
+    .quad 0x03FBC6494A2E418A0
+    .quad 0x03FBC9A3B90F57748  # 0.111728403941 121
+    .quad 0x03FBC9A3B90F57748
+    .quad 0x03FBCCFEDBFEE13A8  # 0.112547740324 122
+    .quad 0x03FBCCFEDBFEE13A8
+    .quad 0x03FBD0EA1362CDBFC  # 0.113504482008 123
+    .quad 0x03FBD0EA1362CDBFC
+    .quad 0x03FBD446BD753D433  # 0.114325275488 124
+    .quad 0x03FBD446BD753D433
+    .quad 0x03FBD7A41C8627307  # 0.115146743223 125
+    .quad 0x03FBD7A41C8627307
+    .quad 0x03FBDB91F09680DF9  # 0.116105975911 126
+    .quad 0x03FBDB91F09680DF9
+    .quad 0x03FBDEF0D8D466DBB  # 0.116928908339 127
+    .quad 0x03FBDEF0D8D466DBB
+    .quad 0x03FBE2507702AF03B  # 0.117752518544 128
+    .quad 0x03FBE2507702AF03B
+    .quad 0x03FBE640EB3D2B411  # 0.118714255240 129
+    .quad 0x03FBE640EB3D2B411
+    .quad 0x03FBE9A214A69DD58  # 0.119539337795 130
+    .quad 0x03FBE9A214A69DD58
+    .quad 0x03FBED03F4F440969  # 0.120365101673 131
+    .quad 0x03FBED03F4F440969
+    .quad 0x03FBF0F70CDD992E4  # 0.121329355484 132
+    .quad 0x03FBF0F70CDD992E4
+    .quad 0x03FBF45A7A78B7C3B  # 0.122156599431 133
+    .quad 0x03FBF45A7A78B7C3B
+    .quad 0x03FBF7BE9FEDBFDED  # 0.122984528276 134
+    .quad 0x03FBF7BE9FEDBFDED
+    .quad 0x03FBFB237D8AB13FB  # 0.123813143156 135
+    .quad 0x03FBFB237D8AB13FB
+    .quad 0x03FBFF1A13EAC95FD  # 0.124780729104 136
+    .quad 0x03FBFF1A13EAC95FD
+    .quad 0x03FC014040CAB0229  # 0.125610834299 137
+    .quad 0x03FC014040CAB0229
+    .quad 0x03FC02F3D4301417B  # 0.126441629140 138
+    .quad 0x03FC02F3D4301417B
+    .quad 0x03FC04A7C44CF87A4  # 0.127273114776 139
+    .quad 0x03FC04A7C44CF87A4
+    .quad 0x03FC06A4D1D26C5E9  # 0.128244055971 140
+    .quad 0x03FC06A4D1D26C5E9
+    .quad 0x03FC08598B59E3A07  # 0.129077042275 141
+    .quad 0x03FC08598B59E3A07
+    .quad 0x03FC0A0EA2164AF02  # 0.129910723024 142
+    .quad 0x03FC0A0EA2164AF02
+    .quad 0x03FC0BC4162F73B66  # 0.130745099376 143
+    .quad 0x03FC0BC4162F73B66
+    .quad 0x03FC0D79E7CD48E58  # 0.131580172493 144
+    .quad 0x03FC0D79E7CD48E58
+    .quad 0x03FC0F301717CF0FB  # 0.132415943541 145
+    .quad 0x03FC0F301717CF0FB
+    .quad 0x03FC10E6A437247B7  # 0.133252413686 146
+    .quad 0x03FC10E6A437247B7
+    .quad 0x03FC12E6BFA8FEAD6  # 0.134229180665 147
+    .quad 0x03FC12E6BFA8FEAD6
+    .quad 0x03FC149E189F8642E  # 0.135067169541 148
+    .quad 0x03FC149E189F8642E
+    .quad 0x03FC1655CFEA923A4  # 0.135905861231 149
+    .quad 0x03FC1655CFEA923A4
+    .quad 0x03FC180DE5B2ACE5C  # 0.136745256915 150
+    .quad 0x03FC180DE5B2ACE5C
+    .quad 0x03FC19C65A207AC07  # 0.137585357777 151
+    .quad 0x03FC19C65A207AC07
+    .quad 0x03FC1B7F2D5CBA842  # 0.138426165001 152
+    .quad 0x03FC1B7F2D5CBA842
+    .quad 0x03FC1D385F90453F2  # 0.139267679777 153
+    .quad 0x03FC1D385F90453F2
+    .quad 0x03FC1EF1F0E40E6CD  # 0.140109903297 154
+    .quad 0x03FC1EF1F0E40E6CD
+    .quad 0x03FC20ABE18124098  # 0.140952836755 155
+    .quad 0x03FC20ABE18124098
+    .quad 0x03FC22663190AEACC  # 0.141796481350 156
+    .quad 0x03FC22663190AEACC
+    .quad 0x03FC2420E13BF19E3  # 0.142640838281 157
+    .quad 0x03FC2420E13BF19E3
+    .quad 0x03FC25DBF0AC4AED2  # 0.143485908754 158
+    .quad 0x03FC25DBF0AC4AED2
+    .quad 0x03FC2797600B3387B  # 0.144331693975 159
+    .quad 0x03FC2797600B3387B
+    .quad 0x03FC29532F823F525  # 0.145178195155 160
+    .quad 0x03FC29532F823F525
+    .quad 0x03FC2B0F5F3B1D3EF  # 0.146025413505 161
+    .quad 0x03FC2B0F5F3B1D3EF
+    .quad 0x03FC2CCBEF5F97653  # 0.146873350243 162
+    .quad 0x03FC2CCBEF5F97653
+    .quad 0x03FC2E88E01993187  # 0.147722006588 163
+    .quad 0x03FC2E88E01993187
+    .quad 0x03FC3046319311009  # 0.148571383763 164
+    .quad 0x03FC3046319311009
+    .quad 0x03FC3203E3F62D328  # 0.149421482992 165
+    .quad 0x03FC3203E3F62D328
+    .quad 0x03FC33C1F76D1F469  # 0.150272305505 166
+    .quad 0x03FC33C1F76D1F469
+    .quad 0x03FC35806C223A70F  # 0.151123852534 167
+    .quad 0x03FC35806C223A70F
+    .quad 0x03FC373F423FED9A1  # 0.151976125313 168
+    .quad 0x03FC373F423FED9A1
+    .quad 0x03FC38FE79F0C3771  # 0.152829125080 169
+    .quad 0x03FC38FE79F0C3771
+    .quad 0x03FC3ABE135F62A12  # 0.153682853077 170
+    .quad 0x03FC3ABE135F62A12
+    .quad 0x03FC3C335E0447D71  # 0.154394850259 171
+    .quad 0x03FC3C335E0447D71
+    .quad 0x03FC3DF3AB13505F9  # 0.155249916579 172
+    .quad 0x03FC3DF3AB13505F9
+    .quad 0x03FC3FB45A59928CA  # 0.156105714663 173
+    .quad 0x03FC3FB45A59928CA
+    .quad 0x03FC41756C0220C81  # 0.156962245765 174
+    .quad 0x03FC41756C0220C81
+    .quad 0x03FC4336E03829D61  # 0.157819511141 175
+    .quad 0x03FC4336E03829D61
+    .quad 0x03FC44F8B726F8EFE  # 0.158677512051 176
+    .quad 0x03FC44F8B726F8EFE
+    .quad 0x03FC46BAF0F9F5DB8  # 0.159536249760 177
+    .quad 0x03FC46BAF0F9F5DB8
+    .quad 0x03FC48326CD3EC797  # 0.160252428262 178
+    .quad 0x03FC48326CD3EC797
+    .quad 0x03FC49F55C6502F81  # 0.161112520058 179
+    .quad 0x03FC49F55C6502F81
+    .quad 0x03FC4BB8AF55DE908  # 0.161973352249 180
+    .quad 0x03FC4BB8AF55DE908
+    .quad 0x03FC4D7C65D25566D  # 0.162834926111 181
+    .quad 0x03FC4D7C65D25566D
+    .quad 0x03FC4F4080065AA7F  # 0.163697242922 182
+    .quad 0x03FC4F4080065AA7F
+    .quad 0x03FC50B98CD30A759  # 0.164416408720 183
+    .quad 0x03FC50B98CD30A759
+    .quad 0x03FC527E5E4A1B58D  # 0.165280090939 184
+    .quad 0x03FC527E5E4A1B58D
+    .quad 0x03FC544393F5DF80F  # 0.166144519750 185
+    .quad 0x03FC544393F5DF80F
+    .quad 0x03FC56092E02BA514  # 0.167009696444 186
+    .quad 0x03FC56092E02BA514
+    .quad 0x03FC57837B3098F2C  # 0.167731249257 187
+    .quad 0x03FC57837B3098F2C
+    .quad 0x03FC5949CDB873419  # 0.168597800437 188
+    .quad 0x03FC5949CDB873419
+    .quad 0x03FC5B10851FC924A  # 0.169465103180 189
+    .quad 0x03FC5B10851FC924A
+    .quad 0x03FC5C8BC079D8289  # 0.170188430518 190
+    .quad 0x03FC5C8BC079D8289
+    .quad 0x03FC5E533144C1718  # 0.171057114516 191
+    .quad 0x03FC5E533144C1718
+    .quad 0x03FC601B076E7A8A8  # 0.171926553783 192
+    .quad 0x03FC601B076E7A8A8
+    .quad 0x03FC619732215D786  # 0.172651664394 193
+    .quad 0x03FC619732215D786
+    .quad 0x03FC635FC298F6C77  # 0.173522491735 194
+    .quad 0x03FC635FC298F6C77
+    .quad 0x03FC6528B8EFA5D16  # 0.174394078077 195
+    .quad 0x03FC6528B8EFA5D16
+    .quad 0x03FC66A5D42A3AD33  # 0.175120980777 196
+    .quad 0x03FC66A5D42A3AD33
+    .quad 0x03FC686F85BAD4298  # 0.175993962063 197
+    .quad 0x03FC686F85BAD4298
+    .quad 0x03FC6A399DABBD383  # 0.176867706111 198
+    .quad 0x03FC6A399DABBD383
+    .quad 0x03FC6BB7AA9F22C40  # 0.177596409780 199
+    .quad 0x03FC6BB7AA9F22C40
+    .quad 0x03FC6D827EB7C1E57  # 0.178471555693 200
+    .quad 0x03FC6D827EB7C1E57
+    .quad 0x03FC6F0128B756AB9  # 0.179201429458 201
+    .quad 0x03FC6F0128B756AB9
+    .quad 0x03FC70CCB9927BCF6  # 0.180077981742 202
+    .quad 0x03FC70CCB9927BCF6
+    .quad 0x03FC7298B1A4E32B6  # 0.180955303044 203
+    .quad 0x03FC7298B1A4E32B6
+    .quad 0x03FC74184F58CC7DC  # 0.181686992547 204
+    .quad 0x03FC74184F58CC7DC
+    .quad 0x03FC75E5051E74141  # 0.182565727226 205
+    .quad 0x03FC75E5051E74141
+    .quad 0x03FC77654128F6127  # 0.183298596442 206
+    .quad 0x03FC77654128F6127
+    .quad 0x03FC7932B53E97639  # 0.184178749058 207
+    .quad 0x03FC7932B53E97639
+    .quad 0x03FC7AB390229D8FD  # 0.184912801796 208
+    .quad 0x03FC7AB390229D8FD
+    .quad 0x03FC7C81C325B4A5E  # 0.185794376934 209
+    .quad 0x03FC7C81C325B4A5E
+    .quad 0x03FC7E033D66CD24A  # 0.186529617023 210
+    .quad 0x03FC7E033D66CD24A
+    .quad 0x03FC7FD22FF599D4C  # 0.187412619288 211
+    .quad 0x03FC7FD22FF599D4C
+    .quad 0x03FC81544A17F67C1  # 0.188149050576 212
+    .quad 0x03FC81544A17F67C1
+    .quad 0x03FC8323FCD17DAC8  # 0.189033484595 213
+    .quad 0x03FC8323FCD17DAC8
+    .quad 0x03FC84A6B759F512D  # 0.189771110947 214
+    .quad 0x03FC84A6B759F512D
+    .quad 0x03FC86772ADE0201C  # 0.190656981373 215
+    .quad 0x03FC86772ADE0201C
+    .quad 0x03FC87FA865210911  # 0.191395806674 216
+    .quad 0x03FC87FA865210911
+    .quad 0x03FC89CBBB4136201  # 0.192283118179 217
+    .quad 0x03FC89CBBB4136201
+    .quad 0x03FC8B4FB826FF291  # 0.193023146334 218
+    .quad 0x03FC8B4FB826FF291
+    .quad 0x03FC8D21AF2299298  # 0.193911903613 219
+    .quad 0x03FC8D21AF2299298
+    .quad 0x03FC8EA64E00E7FC0  # 0.194653138545 220
+    .quad 0x03FC8EA64E00E7FC0
+    .quad 0x03FC902B36AB7681D  # 0.195394923313 221
+    .quad 0x03FC902B36AB7681D
+    .quad 0x03FC91FE49096581E  # 0.196285791969 222
+    .quad 0x03FC91FE49096581E
+    .quad 0x03FC9383D471B869B  # 0.197028789254 223
+    .quad 0x03FC9383D471B869B
+    .quad 0x03FC9557AA6B87F65  # 0.197921115309 224
+    .quad 0x03FC9557AA6B87F65
+    .quad 0x03FC96DDD91A0B959  # 0.198665329082 225
+    .quad 0x03FC96DDD91A0B959
+    .quad 0x03FC9864522D04491  # 0.199410097121 226
+    .quad 0x03FC9864522D04491
+    .quad 0x03FC9A3945D1A44B3  # 0.200304551564 227
+    .quad 0x03FC9A3945D1A44B3
+    .quad 0x03FC9BC062F26FC3B  # 0.201050541900 228
+    .quad 0x03FC9BC062F26FC3B
+    .quad 0x03FC9D47CAD2C1871  # 0.201797089154 229
+    .quad 0x03FC9D47CAD2C1871
+    .quad 0x03FC9F1DDD7FE4F8B  # 0.202693682161 230
+    .quad 0x03FC9F1DDD7FE4F8B
+    .quad 0x03FCA0A5EA371A910  # 0.203441457564 231
+    .quad 0x03FCA0A5EA371A910
+    .quad 0x03FCA22E42098F498  # 0.204189792554 232
+    .quad 0x03FCA22E42098F498
+    .quad 0x03FCA405751F6CCE4  # 0.205088534376 233
+    .quad 0x03FCA405751F6CCE4
+    .quad 0x03FCA58E729348F40  # 0.205838103409 234
+    .quad 0x03FCA58E729348F40
+    .quad 0x03FCA717BB7EC64A3  # 0.206588234717 235
+    .quad 0x03FCA717BB7EC64A3
+    .quad 0x03FCA8F010601E5FD  # 0.207489135679 236
+    .quad 0x03FCA8F010601E5FD
+    .quad 0x03FCAA79FFB8FCD48  # 0.208240506966 237
+    .quad 0x03FCAA79FFB8FCD48
+    .quad 0x03FCAC043AE68965A  # 0.208992443238 238
+    .quad 0x03FCAC043AE68965A
+    .quad 0x03FCAD8EC205FB6AD  # 0.209744945343 239
+    .quad 0x03FCAD8EC205FB6AD
+    .quad 0x03FCAF6895610DBAD  # 0.210648695969 240
+    .quad 0x03FCAF6895610DBAD
+    .quad 0x03FCB0F3C3FBD65C9  # 0.211402445910 241
+    .quad 0x03FCB0F3C3FBD65C9
+    .quad 0x03FCB27F3EE674219  # 0.212156764419 242
+    .quad 0x03FCB27F3EE674219
+    .quad 0x03FCB40B063E65B0F  # 0.212911652354 243
+    .quad 0x03FCB40B063E65B0F
+    .quad 0x03FCB5E65A8096C88  # 0.213818270730 244
+    .quad 0x03FCB5E65A8096C88
+    .quad 0x03FCB772CA646760C  # 0.214574414434 245
+    .quad 0x03FCB772CA646760C
+    .quad 0x03FCB8FF871461198  # 0.215331130323 246
+    .quad 0x03FCB8FF871461198
+    .quad 0x03FCBA8C90AE4AD19  # 0.216088419265 247
+    .quad 0x03FCBA8C90AE4AD19
+    .quad 0x03FCBC19E74FFCBDA  # 0.216846282128 248
+    .quad 0x03FCBC19E74FFCBDA
+    .quad 0x03FCBDF71B83DAE7A  # 0.217756476365 249
+    .quad 0x03FCBDF71B83DAE7A
+    .quad 0x03FCBF851C067555C  # 0.218515604922 250
+    .quad 0x03FCBF851C067555C
+    .quad 0x03FCC11369F0CDB3C  # 0.219275310193 251
+    .quad 0x03FCC11369F0CDB3C
+    .quad 0x03FCC2A205610593E  # 0.220035593055 252
+    .quad 0x03FCC2A205610593E
+    .quad 0x03FCC430EE755023B  # 0.220796454387 253
+    .quad 0x03FCC430EE755023B
+    .quad 0x03FCC5C0254BF23A8  # 0.221557895069 254
+    .quad 0x03FCC5C0254BF23A8
+    .quad 0x03FCC79F9AB632BF1  # 0.222472389875 255
+    .quad 0x03FCC79F9AB632BF1
+    .quad 0x03FCC92F7D09ABE20  # 0.223235108240 256
+    .quad 0x03FCC92F7D09ABE20
+    .quad 0x03FCCABFAD80D023D  # 0.223998408788 257
+    .quad 0x03FCCABFAD80D023D
+    .quad 0x03FCCC502C3A2F1E8  # 0.224762292410 258
+    .quad 0x03FCCC502C3A2F1E8
+    .quad 0x03FCCDE0F9546A5E7  # 0.225526759995 259
+    .quad 0x03FCCDE0F9546A5E7
+    .quad 0x03FCCF7214EE356E9  # 0.226291812439 260
+    .quad 0x03FCCF7214EE356E9
+    .quad 0x03FCD1037F2655E7B  # 0.227057450635 261
+    .quad 0x03FCD1037F2655E7B
+    .quad 0x03FCD295381BA37E9  # 0.227823675483 262
+    .quad 0x03FCD295381BA37E9
+    .quad 0x03FCD4273FED08111  # 0.228590487882 263
+    .quad 0x03FCD4273FED08111
+    .quad 0x03FCD5B996B97FB5F  # 0.229357888733 264
+    .quad 0x03FCD5B996B97FB5F
+    .quad 0x03FCD74C3CA018C9C  # 0.230125878940 265
+    .quad 0x03FCD74C3CA018C9C
+    .quad 0x03FCD8DF31BFF3FF2  # 0.230894459410 266
+    .quad 0x03FCD8DF31BFF3FF2
+    .quad 0x03FCDA727638446A1  # 0.231663631050 267
+    .quad 0x03FCDA727638446A1
+    .quad 0x03FCDC56CAE452F5B  # 0.232587418645 268
+    .quad 0x03FCDC56CAE452F5B
+    .quad 0x03FCDDEABE5A3926E  # 0.233357894066 269
+    .quad 0x03FCDDEABE5A3926E
+    .quad 0x03FCDF7F018CE771F  # 0.234128963578 270
+    .quad 0x03FCDF7F018CE771F
+    .quad 0x03FCE113949BDEC62  # 0.234900628096 271
+    .quad 0x03FCE113949BDEC62
+    .quad 0x03FCE2A877A6B2C0F  # 0.235672888541 272
+    .quad 0x03FCE2A877A6B2C0F
+    .quad 0x03FCE43DAACD09BEC  # 0.236445745833 273
+    .quad 0x03FCE43DAACD09BEC
+    .quad 0x03FCE5D32E2E9CE87  # 0.237219200895 274
+    .quad 0x03FCE5D32E2E9CE87
+    .quad 0x03FCE76901EB38427  # 0.237993254653 275
+    .quad 0x03FCE76901EB38427
+    .quad 0x03FCE8ADE53F76866  # 0.238612929343 276
+    .quad 0x03FCE8ADE53F76866
+    .quad 0x03FCEA4449F04AAF4  # 0.239388063093 277
+    .quad 0x03FCEA4449F04AAF4
+    .quad 0x03FCEBDAFF5593E99  # 0.240163798141 278
+    .quad 0x03FCEBDAFF5593E99
+    .quad 0x03FCED72058F666C5  # 0.240940135421 279
+    .quad 0x03FCED72058F666C5
+    .quad 0x03FCEF095CBDE9937  # 0.241717075868 280
+    .quad 0x03FCEF095CBDE9937
+    .quad 0x03FCF0A1050157ED6  # 0.242494620422 281
+    .quad 0x03FCF0A1050157ED6
+    .quad 0x03FCF238FE79FF4BF  # 0.243272770021 282
+    .quad 0x03FCF238FE79FF4BF
+    .quad 0x03FCF3D1494840D2F  # 0.244051525609 283
+    .quad 0x03FCF3D1494840D2F
+    .quad 0x03FCF569E58C91077  # 0.244830888130 284
+    .quad 0x03FCF569E58C91077
+    .quad 0x03FCF702D36777DF0  # 0.245610858531 285
+    .quad 0x03FCF702D36777DF0
+    .quad 0x03FCF89C12F990D0C  # 0.246391437760 286
+    .quad 0x03FCF89C12F990D0C
+    .quad 0x03FCFA35A4638AE2C  # 0.247172626770 287
+    .quad 0x03FCFA35A4638AE2C
+    .quad 0x03FCFB7D86EEE3B92  # 0.247798017660 288
+    .quad 0x03FCFB7D86EEE3B92
+    .quad 0x03FCFD17ABFCDB683  # 0.248580306677 289
+    .quad 0x03FCFD17ABFCDB683
+    .quad 0x03FCFEB2233EA07CB  # 0.249363208150 290
+    .quad 0x03FCFEB2233EA07CB
+    .quad 0x03FD0026766A9671C  # 0.250146723037 291
+    .quad 0x03FD0026766A9671C
+    .quad 0x03FD00F40470C7323  # 0.250930852302 292
+    .quad 0x03FD00F40470C7323
+    .quad 0x03FD01C1BBC2735A3  # 0.251715596908 293
+    .quad 0x03FD01C1BBC2735A3
+    .quad 0x03FD028F9C7035C1D  # 0.252500957822 294
+    .quad 0x03FD028F9C7035C1D
+    .quad 0x03FD03346E0106062  # 0.253129690945 295
+    .quad 0x03FD03346E0106062
+    .quad 0x03FD0402994B4F041  # 0.253916163656 296
+    .quad 0x03FD0402994B4F041
+    .quad 0x03FD04D0EE20620AF  # 0.254703255393 297
+    .quad 0x03FD04D0EE20620AF
+    .quad 0x03FD059F6C910034D  # 0.255490967131 298
+    .quad 0x03FD059F6C910034D
+    .quad 0x03FD066E14ADF4BFD  # 0.256279299848 299
+    .quad 0x03FD066E14ADF4BFD
+    .quad 0x03FD07138604D5864  # 0.256910413785 300
+    .quad 0x03FD07138604D5864
+    .quad 0x03FD07E2794F3E8C1  # 0.257699866735 301
+    .quad 0x03FD07E2794F3E8C1
+    .quad 0x03FD08B196753A125  # 0.258489943414 302
+    .quad 0x03FD08B196753A125
+    .quad 0x03FD0980DD87BA2DD  # 0.259280644807 303
+    .quad 0x03FD0980DD87BA2DD
+    .quad 0x03FD0A504E97BB40C  # 0.260071971904 304
+    .quad 0x03FD0A504E97BB40C
+    .quad 0x03FD0AF660EB9E278  # 0.260705484754 305
+    .quad 0x03FD0AF660EB9E278
+    .quad 0x03FD0BC61DBBA97CB  # 0.261497940616 306
+    .quad 0x03FD0BC61DBBA97CB
+    .quad 0x03FD0C9604B8FC51E  # 0.262291024962 307
+    .quad 0x03FD0C9604B8FC51E
+    .quad 0x03FD0D3C7586CD5E5  # 0.262925945618 308
+    .quad 0x03FD0D3C7586CD5E5
+    .quad 0x03FD0E0CA89A72D29  # 0.263720163752 309
+    .quad 0x03FD0E0CA89A72D29
+    .quad 0x03FD0EDD060B78082  # 0.264515013170 310
+    .quad 0x03FD0EDD060B78082
+    .quad 0x03FD0FAD8DEB1E2C0  # 0.265310494876 311
+    .quad 0x03FD0FAD8DEB1E2C0
+    .quad 0x03FD10547F9D26ABC  # 0.265947336165 312
+    .quad 0x03FD10547F9D26ABC
+    .quad 0x03FD1125540925114  # 0.266743958529 313
+    .quad 0x03FD1125540925114
+    .quad 0x03FD11F653144CB8B  # 0.267541216005 314
+    .quad 0x03FD11F653144CB8B
+    .quad 0x03FD129DA43F5BE9E  # 0.268179479949 315
+    .quad 0x03FD129DA43F5BE9E
+    .quad 0x03FD136EF02E8290C  # 0.268977883185 316
+    .quad 0x03FD136EF02E8290C
+    .quad 0x03FD144066EDAE406  # 0.269776924378 317
+    .quad 0x03FD144066EDAE406
+    .quad 0x03FD14E817FF359D7  # 0.270416617347 318
+    .quad 0x03FD14E817FF359D7
+    .quad 0x03FD15B9DBFA9DEC8  # 0.271216809436 319
+    .quad 0x03FD15B9DBFA9DEC8
+    .quad 0x03FD168BCAF73B3EB  # 0.272017642345 320
+    .quad 0x03FD168BCAF73B3EB
+    .quad 0x03FD1733DC5D68DE8  # 0.272658770753 321
+    .quad 0x03FD1733DC5D68DE8
+    .quad 0x03FD180618EF18ADE  # 0.273460759729 322
+    .quad 0x03FD180618EF18ADE
+    .quad 0x03FD18D880B3826FE  # 0.274263392407 323
+    .quad 0x03FD18D880B3826FE
+    .quad 0x03FD1980F2DD42B6F  # 0.274905962710 324
+    .quad 0x03FD1980F2DD42B6F
+    .quad 0x03FD1A53A8902E70B  # 0.275709756661 325
+    .quad 0x03FD1A53A8902E70B
+    .quad 0x03FD1AFC59297024D  # 0.276353257326 326
+    .quad 0x03FD1AFC59297024D
+    .quad 0x03FD1BCF5D04AE1EA  # 0.277158215914 327
+    .quad 0x03FD1BCF5D04AE1EA
+    .quad 0x03FD1CA28C64BAE54  # 0.277963822983 328
+    .quad 0x03FD1CA28C64BAE54
+    .quad 0x03FD1D4B9E796C245  # 0.278608776246 329
+    .quad 0x03FD1D4B9E796C245
+    .quad 0x03FD1E1F1C5C3A06C  # 0.279415553216 330
+    .quad 0x03FD1E1F1C5C3A06C
+    .quad 0x03FD1EC86D5747AAD  # 0.280061443760 331
+    .quad 0x03FD1EC86D5747AAD
+    .quad 0x03FD1F9C39F74C559  # 0.280869394034 332
+    .quad 0x03FD1F9C39F74C559
+    .quad 0x03FD2070326F1F789  # 0.281677997620 333
+    .quad 0x03FD2070326F1F789
+    .quad 0x03FD2119E59F8789C  # 0.282325351583 334
+    .quad 0x03FD2119E59F8789C
+    .quad 0x03FD21EE2D300381C  # 0.283135133796 335
+    .quad 0x03FD21EE2D300381C
+    .quad 0x03FD22981FBEF797A  # 0.283783432036 336
+    .quad 0x03FD22981FBEF797A
+    .quad 0x03FD236CB6A339EED  # 0.284594396317 337
+    .quad 0x03FD236CB6A339EED
+    .quad 0x03FD2416E8C01F606  # 0.285243641592 338
+    .quad 0x03FD2416E8C01F606
+    .quad 0x03FD24EBCF3387FF6  # 0.286055791397 339
+    .quad 0x03FD24EBCF3387FF6
+    .quad 0x03FD2596410DF963A  # 0.286705986479 340
+    .quad 0x03FD2596410DF963A
+    .quad 0x03FD266B774C2AF55  # 0.287519325279 341
+    .quad 0x03FD266B774C2AF55
+    .quad 0x03FD27162913F873F  # 0.288170472950 342
+    .quad 0x03FD27162913F873F
+    .quad 0x03FD27EBAF58D8C9C  # 0.288985004232 343
+    .quad 0x03FD27EBAF58D8C9C
+    .quad 0x03FD2896A13E086A3  # 0.289637107288 344
+    .quad 0x03FD2896A13E086A3
+    .quad 0x03FD296C77C5C0E13  # 0.290452834554 345
+    .quad 0x03FD296C77C5C0E13
+    .quad 0x03FD2A17A9F88EDD2  # 0.291105895801 346
+    .quad 0x03FD2A17A9F88EDD2
+    .quad 0x03FD2AEDD0FF8CC2C  # 0.291922822568 347
+    .quad 0x03FD2AEDD0FF8CC2C
+    .quad 0x03FD2B9943B06BD77  # 0.292576844829 348
+    .quad 0x03FD2B9943B06BD77
+    .quad 0x03FD2C6FBB7360D0E  # 0.293394974630 349
+    .quad 0x03FD2C6FBB7360D0E
+    .quad 0x03FD2D1B6ED2FA90C  # 0.294049960734 350
+    .quad 0x03FD2D1B6ED2FA90C
+    .quad 0x03FD2DC73F01B0DD4  # 0.294705376127 351
+    .quad 0x03FD2DC73F01B0DD4
+    .quad 0x03FD2E9E2BCE12286  # 0.295525249913 352
+    .quad 0x03FD2E9E2BCE12286
+    .quad 0x03FD2F4A3CF22EDC2  # 0.296181633264 353
+    .quad 0x03FD2F4A3CF22EDC2
+    .quad 0x03FD30217B1006601  # 0.297002718785 354
+    .quad 0x03FD30217B1006601
+    .quad 0x03FD30CDCD5ABA762  # 0.297660072959 355
+    .quad 0x03FD30CDCD5ABA762
+    .quad 0x03FD31A55D07A8590  # 0.298482373803 356
+    .quad 0x03FD31A55D07A8590
+    .quad 0x03FD3251F0AA5CC1A  # 0.299140701674 357
+    .quad 0x03FD3251F0AA5CC1A
+    .quad 0x03FD32FEA167A6D70  # 0.299799463226 358
+    .quad 0x03FD32FEA167A6D70
+    .quad 0x03FD33D6A7509D491  # 0.300623525901 359
+    .quad 0x03FD33D6A7509D491
+    .quad 0x03FD348399ADA9D94  # 0.301283265328 360
+    .quad 0x03FD348399ADA9D94
+    .quad 0x03FD3530A9454ADC9  # 0.301943440298 361
+    .quad 0x03FD3530A9454ADC9
+    .quad 0x03FD360925EC44F5C  # 0.302769272371 362
+    .quad 0x03FD360925EC44F5C
+    .quad 0x03FD36B6776BE1116  # 0.303430429420 363
+    .quad 0x03FD36B6776BE1116
+    .quad 0x03FD378F469437FB4  # 0.304257490918 364
+    .quad 0x03FD378F469437FB4
+    .quad 0x03FD383CDA2E14ECB  # 0.304919632971 365
+    .quad 0x03FD383CDA2E14ECB
+    .quad 0x03FD38EA8B3924521  # 0.305582213748 366
+    .quad 0x03FD38EA8B3924521
+    .quad 0x03FD39C3D1FD60E74  # 0.306411057558 367
+    .quad 0x03FD39C3D1FD60E74
+    .quad 0x03FD3A71C56BB48C7  # 0.307074627589 368
+    .quad 0x03FD3A71C56BB48C7
+    .quad 0x03FD3B1FD66BC8D10  # 0.307738638238 369
+    .quad 0x03FD3B1FD66BC8D10
+    .quad 0x03FD3BF995502CB5C  # 0.308569272059 370
+    .quad 0x03FD3BF995502CB5C
+    .quad 0x03FD3CA7E8FD01DF6  # 0.309234276240 371
+    .quad 0x03FD3CA7E8FD01DF6
+    .quad 0x03FD3D565A5C5BF11  # 0.309899722945 372
+    .quad 0x03FD3D565A5C5BF11
+    .quad 0x03FD3E3091E6049FB  # 0.310732154526 373
+    .quad 0x03FD3E3091E6049FB
+    .quad 0x03FD3EDF463C1683E  # 0.311398599069 374
+    .quad 0x03FD3EDF463C1683E
+    .quad 0x03FD3F8E1865A82DD  # 0.312065488057 375
+    .quad 0x03FD3F8E1865A82DD
+    .quad 0x03FD403D086CEA79B  # 0.312732822082 376
+    .quad 0x03FD403D086CEA79B
+    .quad 0x03FD4117DE854CA15  # 0.313567616354 377
+    .quad 0x03FD4117DE854CA15
+    .quad 0x03FD41C711E4BA15E  # 0.314235953889 378
+    .quad 0x03FD41C711E4BA15E
+    .quad 0x03FD427663431B221  # 0.314904738398 379
+    .quad 0x03FD427663431B221
+    .quad 0x03FD4325D2AAB6F18  # 0.315573970480 380
+    .quad 0x03FD4325D2AAB6F18
+    .quad 0x03FD44014838E5513  # 0.316411140893 381
+    .quad 0x03FD44014838E5513
+    .quad 0x03FD44B0FB5AF4F44  # 0.317081382205 382
+    .quad 0x03FD44B0FB5AF4F44
+    .quad 0x03FD4560CCA7CB3B2  # 0.317752073041 383
+    .quad 0x03FD4560CCA7CB3B2
+    .quad 0x03FD4610BC29C5E18  # 0.318423214006 384
+    .quad 0x03FD4610BC29C5E18
+    .quad 0x03FD46ECD216CDCB5  # 0.319262774126 385
+    .quad 0x03FD46ECD216CDCB5
+    .quad 0x03FD479D05B65CB60  # 0.319934930091 386
+    .quad 0x03FD479D05B65CB60
+    .quad 0x03FD484D57ACE5A1A  # 0.320607538154 387
+    .quad 0x03FD484D57ACE5A1A
+    .quad 0x03FD48FDC804DD1CB  # 0.321280598924 388
+    .quad 0x03FD48FDC804DD1CB
+    .quad 0x03FD49DA7F3BCC420  # 0.322122562432 389
+    .quad 0x03FD49DA7F3BCC420
+    .quad 0x03FD4A8B341552B09  # 0.322796644021 390
+    .quad 0x03FD4A8B341552B09
+    .quad 0x03FD4B3C077267E9A  # 0.323471180303 391
+    .quad 0x03FD4B3C077267E9A
+    .quad 0x03FD4BECF95D97914  # 0.324146171892 392
+    .quad 0x03FD4BECF95D97914
+    .quad 0x03FD4C9E09E172C3D  # 0.324821619401 393
+    .quad 0x03FD4C9E09E172C3D
+    .quad 0x03FD4D4F3908901A0  # 0.325497523449 394
+    .quad 0x03FD4D4F3908901A0
+    .quad 0x03FD4E2CDF1F341C1  # 0.326343046455 395
+    .quad 0x03FD4E2CDF1F341C1
+    .quad 0x03FD4EDE535C79642  # 0.327019979972 396
+    .quad 0x03FD4EDE535C79642
+    .quad 0x03FD4F8FE65F90500  # 0.327697372039 397
+    .quad 0x03FD4F8FE65F90500
+    .quad 0x03FD5041983326F2D  # 0.328375223276 398
+    .quad 0x03FD5041983326F2D
+    .quad 0x03FD50F368E1F0F02  # 0.329053534308 399
+    .quad 0x03FD50F368E1F0F02
+    .quad 0x03FD51A55876A77F5  # 0.329732305758 400
+    .quad 0x03FD51A55876A77F5
+    .quad 0x03FD5283EF743F98B  # 0.330581418486 401
+    .quad 0x03FD5283EF743F98B
+    .quad 0x03FD533624B59CA35  # 0.331261228165 402
+    .quad 0x03FD533624B59CA35
+    .quad 0x03FD53E878FFE6EAE  # 0.331941500300 403
+    .quad 0x03FD53E878FFE6EAE
+    .quad 0x03FD549AEC5DEF880  # 0.332622235521 404
+    .quad 0x03FD549AEC5DEF880
+    .quad 0x03FD554D7EDA8D3C4  # 0.333303434457 405
+    .quad 0x03FD554D7EDA8D3C4
+    .quad 0x03FD560030809C759  # 0.333985097742 406
+    .quad 0x03FD560030809C759
+    .quad 0x03FD56B3015AFF52C  # 0.334667226008 407
+    .quad 0x03FD56B3015AFF52C
+    .quad 0x03FD5765F1749DA6C  # 0.335349819892 408
+    .quad 0x03FD5765F1749DA6C
+    .quad 0x03FD581900D864FD7  # 0.336032880027 409
+    .quad 0x03FD581900D864FD7
+    .quad 0x03FD58CC2F91489F5  # 0.336716407053 410
+    .quad 0x03FD58CC2F91489F5
+    .quad 0x03FD59AC5618CCE38  # 0.337571473373 411
+    .quad 0x03FD59AC5618CCE38
+    .quad 0x03FD5A5FCB795780C  # 0.338256053239 412
+    .quad 0x03FD5A5FCB795780C
+    .quad 0x03FD5B136052BCE39  # 0.338941102075 413
+    .quad 0x03FD5B136052BCE39
+    .quad 0x03FD5BC714B008E23  # 0.339626620526 414
+    .quad 0x03FD5BC714B008E23
+    .quad 0x03FD5C7AE89C4D254  # 0.340312609234 415
+    .quad 0x03FD5C7AE89C4D254
+    .quad 0x03FD5D2EDC22A12BA  # 0.340999068845 416
+    .quad 0x03FD5D2EDC22A12BA
+    .quad 0x03FD5DE2EF4E224D6  # 0.341686000008 417
+    .quad 0x03FD5DE2EF4E224D6
+    .quad 0x03FD5E972229F3C15  # 0.342373403369 418
+    .quad 0x03FD5E972229F3C15
+    .quad 0x03FD5F4B74C13EA04  # 0.343061279578 419
+    .quad 0x03FD5F4B74C13EA04
+    .quad 0x03FD5FFFE71F31E9A  # 0.343749629287 420
+    .quad 0x03FD5FFFE71F31E9A
+    .quad 0x03FD60B4794F02875  # 0.344438453147 421
+    .quad 0x03FD60B4794F02875
+    .quad 0x03FD61692B5BEB520  # 0.345127751813 422
+    .quad 0x03FD61692B5BEB520
+    .quad 0x03FD621DFD512D14F  # 0.345817525940 423
+    .quad 0x03FD621DFD512D14F
+    .quad 0x03FD62D2EF3A0E933  # 0.346507776183 424
+    .quad 0x03FD62D2EF3A0E933
+    .quad 0x03FD63880121DC8AB  # 0.347198503200 425
+    .quad 0x03FD63880121DC8AB
+    .quad 0x03FD643D3313E9B92  # 0.347889707652 426
+    .quad 0x03FD643D3313E9B92
+    .quad 0x03FD64F2851B8EE01  # 0.348581390197 427
+    .quad 0x03FD64F2851B8EE01
+    .quad 0x03FD65A7F7442AC90  # 0.349273551498 428
+    .quad 0x03FD65A7F7442AC90
+    .quad 0x03FD665D8999224A5  # 0.349966192218 429
+    .quad 0x03FD665D8999224A5
+    .quad 0x03FD67133C25E04A5  # 0.350659313022 430
+    .quad 0x03FD67133C25E04A5
+    .quad 0x03FD67C90EF5D5C4C  # 0.351352914576 431
+    .quad 0x03FD67C90EF5D5C4C
+    .quad 0x03FD687F021479CEE  # 0.352046997547 432
+    .quad 0x03FD687F021479CEE
+    .quad 0x03FD6935158D499B3  # 0.352741562603 433
+    .quad 0x03FD6935158D499B3
+    .quad 0x03FD69EB496BC87E5  # 0.353436610416 434
+    .quad 0x03FD69EB496BC87E5
+    .quad 0x03FD6AA19DBB7FF34  # 0.354132141656 435
+    .quad 0x03FD6AA19DBB7FF34
+    .quad 0x03FD6B581287FF9FD  # 0.354828156996 436
+    .quad 0x03FD6B581287FF9FD
+    .quad 0x03FD6C0EA7DCDD591  # 0.355524657112 437
+    .quad 0x03FD6C0EA7DCDD591
+    .quad 0x03FD6C97AD3CFCFD9  # 0.356047350738 438
+    .quad 0x03FD6C97AD3CFCFD9
+    .quad 0x03FD6D4E7B9C727EC  # 0.356744700836 439
+    .quad 0x03FD6D4E7B9C727EC
+    .quad 0x03FD6E056AA4421D6  # 0.357442537571 440
+    .quad 0x03FD6E056AA4421D6
+    .quad 0x03FD6EBC7A6019066  # 0.358140861621 441
+    .quad 0x03FD6EBC7A6019066
+    .quad 0x03FD6F73AADBAAAB7  # 0.358839673669 442
+    .quad 0x03FD6F73AADBAAAB7
+    .quad 0x03FD702AFC22B0C6D  # 0.359538974397 443
+    .quad 0x03FD702AFC22B0C6D
+    .quad 0x03FD70E26E40EB5FA  # 0.360238764489 444
+    .quad 0x03FD70E26E40EB5FA
+    .quad 0x03FD719A014220CF5  # 0.360939044629 445
+    .quad 0x03FD719A014220CF5
+    .quad 0x03FD7251B5321DC54  # 0.361639815506 446
+    .quad 0x03FD7251B5321DC54
+    .quad 0x03FD73098A1CB54BA  # 0.362341077807 447
+    .quad 0x03FD73098A1CB54BA
+    .quad 0x03FD73937F783CEBA  # 0.362867347444 448
+    .quad 0x03FD73937F783CEBA
+    .quad 0x03FD744B8E35E9EDA  # 0.363569471398 449
+    .quad 0x03FD744B8E35E9EDA
+    .quad 0x03FD7503BE0ED6C66  # 0.364272088676 450
+    .quad 0x03FD7503BE0ED6C66
+    .quad 0x03FD75BC0F0EEE7DE  # 0.364975199972 451
+    .quad 0x03FD75BC0F0EEE7DE
+    .quad 0x03FD76748142228C7  # 0.365678805982 452
+    .quad 0x03FD76748142228C7
+    .quad 0x03FD772D14B46AE00  # 0.366382907402 453
+    .quad 0x03FD772D14B46AE00
+    .quad 0x03FD77E5C971C5E06  # 0.367087504930 454
+    .quad 0x03FD77E5C971C5E06
+    .quad 0x03FD787066E04915F  # 0.367616279067 455
+    .quad 0x03FD787066E04915F
+    .quad 0x03FD792955FDF47A3  # 0.368321746469 456
+    .quad 0x03FD792955FDF47A3
+    .quad 0x03FD79E26687CFB3D  # 0.369027711906 457
+    .quad 0x03FD79E26687CFB3D
+    .quad 0x03FD7A9B9889F19E2  # 0.369734176082 458
+    .quad 0x03FD7A9B9889F19E2
+    .quad 0x03FD7B54EC1077A48  # 0.370441139703 459
+    .quad 0x03FD7B54EC1077A48
+    .quad 0x03FD7C0E612785C74  # 0.371148603475 460
+    .quad 0x03FD7C0E612785C74
+    .quad 0x03FD7C998F06FB152  # 0.371679529954 461
+    .quad 0x03FD7C998F06FB152
+    .quad 0x03FD7D533EF841E8A  # 0.372387870696 462
+    .quad 0x03FD7D533EF841E8A
+    .quad 0x03FD7E0D109B95F19  # 0.373096713539 463
+    .quad 0x03FD7E0D109B95F19
+    .quad 0x03FD7EC703FD340AA  # 0.373806059198 464
+    .quad 0x03FD7EC703FD340AA
+    .quad 0x03FD7F8119295FB9B  # 0.374515908385 465
+    .quad 0x03FD7F8119295FB9B
+    .quad 0x03FD800CBF3ED1CC2  # 0.375048626146 466
+    .quad 0x03FD800CBF3ED1CC2
+    .quad 0x03FD80C70FAB0BDF6  # 0.375759358229 467
+    .quad 0x03FD80C70FAB0BDF6
+    .quad 0x03FD81818203AFC7F  # 0.376470595813 468
+    .quad 0x03FD81818203AFC7F
+    .quad 0x03FD823C16551A3C3  # 0.377182339615 469
+    .quad 0x03FD823C16551A3C3
+    .quad 0x03FD82C81BE4DFF4A  # 0.377716480107 470
+    .quad 0x03FD82C81BE4DFF4A
+    .quad 0x03FD8382EBC7794D1  # 0.378429111528 471
+    .quad 0x03FD8382EBC7794D1
+    .quad 0x03FD843DDDC4FB137  # 0.379142251156 472
+    .quad 0x03FD843DDDC4FB137
+    .quad 0x03FD84F8F1E9DB72B  # 0.379855899714 473
+    .quad 0x03FD84F8F1E9DB72B
+    .quad 0x03FD85855776DCBFB  # 0.380391470556 474
+    .quad 0x03FD85855776DCBFB
+    .quad 0x03FD8640A77EB3957  # 0.381106011494 475
+    .quad 0x03FD8640A77EB3957
+    .quad 0x03FD86FC19D05148E  # 0.381821063366 476
+    .quad 0x03FD86FC19D05148E
+    .quad 0x03FD87B7AE7845C0F  # 0.382536626902 477
+    .quad 0x03FD87B7AE7845C0F
+    .quad 0x03FD8844748678822  # 0.383073635776 478
+    .quad 0x03FD8844748678822
+    .quad 0x03FD89004563D3DFD  # 0.383790096491 479
+    .quad 0x03FD89004563D3DFD
+    .quad 0x03FD89BC38BA356B4  # 0.384507070890 480
+    .quad 0x03FD89BC38BA356B4
+    .quad 0x03FD8A4945E20894E  # 0.385045139237 481
+    .quad 0x03FD8A4945E20894E
+    .quad 0x03FD8B0575AAB1FC5  # 0.385763014358 482
+    .quad 0x03FD8B0575AAB1FC5
+    .quad 0x03FD8BC1C80F45A32  # 0.386481405193 483
+    .quad 0x03FD8BC1C80F45A32
+    .quad 0x03FD8C7E3D1C80B2F  # 0.387200312485 484
+    .quad 0x03FD8C7E3D1C80B2F
+    .quad 0x03FD8D0BABACC89EE  # 0.387739832326 485
+    .quad 0x03FD8D0BABACC89EE
+    .quad 0x03FD8DC85D7FE5013  # 0.388459645206 486
+    .quad 0x03FD8DC85D7FE5013
+    .quad 0x03FD8E85321ED5598  # 0.389179976589 487
+    .quad 0x03FD8E85321ED5598
+    .quad 0x03FD8F12E873862C7  # 0.389720565845 488
+    .quad 0x03FD8F12E873862C7
+    .quad 0x03FD8FCFFA1614AA0  # 0.390441806410 489
+    .quad 0x03FD8FCFFA1614AA0
+    .quad 0x03FD908D2EA7D9511  # 0.391163567538 490
+    .quad 0x03FD908D2EA7D9511
+    .quad 0x03FD911B2D09ED9D6  # 0.391705230456 491
+    .quad 0x03FD911B2D09ED9D6
+    .quad 0x03FD91D89EDD6B7FF  # 0.392427904381 492
+    .quad 0x03FD91D89EDD6B7FF
+    .quad 0x03FD929633C3B7D3E  # 0.393151100941 493
+    .quad 0x03FD929633C3B7D3E
+    .quad 0x03FD93247A7C99B52  # 0.393693841796 494
+    .quad 0x03FD93247A7C99B52
+    .quad 0x03FD93E24CE3195E8  # 0.394417954789 495
+    .quad 0x03FD93E24CE3195E8
+    .quad 0x03FD9470C1CB1962E  # 0.394961383840 496
+    .quad 0x03FD9470C1CB1962E
+    .quad 0x03FD952ED1D9C0435  # 0.395686415592 497
+    .quad 0x03FD952ED1D9C0435
+    .quad 0x03FD95ED0535EA5D9  # 0.396411973396 498
+    .quad 0x03FD95ED0535EA5D9
+    .quad 0x03FD967BC2EDCCE17  # 0.396956487431 499
+    .quad 0x03FD967BC2EDCCE17
+    .quad 0x03FD973A3431356AE  # 0.397682967666 500
+    .quad 0x03FD973A3431356AE
+    .quad 0x03FD97F8C8E64A1C7  # 0.398409976059 501
+    .quad 0x03FD97F8C8E64A1C7
+    .quad 0x03FD9887CFB8A3932  # 0.398955579419 502
+    .quad 0x03FD9887CFB8A3932
+    .quad 0x03FD9946A2946EF3C  # 0.399683513937 503
+    .quad 0x03FD9946A2946EF3C
+    .quad 0x03FD99D5D8130607C  # 0.400229812776 504
+    .quad 0x03FD99D5D8130607C
+    .quad 0x03FD9A94E93E1EC37  # 0.400958675782 505
+    .quad 0x03FD9A94E93E1EC37
+    .quad 0x03FD9B244D87735E8  # 0.401505671875 506
+    .quad 0x03FD9B244D87735E8
+    .quad 0x03FD9BE39D2A97F0B  # 0.402235465741 507
+    .quad 0x03FD9BE39D2A97F0B
+    .quad 0x03FD9CA3109266E23  # 0.402965792595 508
+    .quad 0x03FD9CA3109266E23
+    .quad 0x03FD9D32BEA15ED3A  # 0.403513887977 509
+    .quad 0x03FD9D32BEA15ED3A
+    .quad 0x03FD9DF270C1914A8  # 0.404245149435 510
+    .quad 0x03FD9DF270C1914A8
+    .quad 0x03FD9E824DEA3E135  # 0.404793946669 511
+    .quad 0x03FD9E824DEA3E135
+    .quad 0x03FD9F423EEBF9DA1  # 0.405526145127 512
+    .quad 0x03FD9F423EEBF9DA1
+    .quad 0x03FD9FD24B4D47012  # 0.406075646011 513
+    .quad 0x03FD9FD24B4D47012
+    .quad 0x03FDA0927B59DA6E2  # 0.406808783874 514
+    .quad 0x03FDA0927B59DA6E2
+    .quad 0x03FDA152CF7F3B46D  # 0.407542459622 515
+    .quad 0x03FDA152CF7F3B46D
+    .quad 0x03FDA1E32653B420E  # 0.408093069896 516
+    .quad 0x03FDA1E32653B420E
+    .quad 0x03FDA2A3B9C527DB1  # 0.408827688845 517
+    .quad 0x03FDA2A3B9C527DB1
+    .quad 0x03FDA33440224FA79  # 0.409379007429 518
+    .quad 0x03FDA33440224FA79
+    .quad 0x03FDA3F513098DD09  # 0.410114572008 519
+    .quad 0x03FDA3F513098DD09
+    .quad 0x03FDA485C90EBDB0C  # 0.410666600728 520
+    .quad 0x03FDA485C90EBDB0C
+    .quad 0x03FDA546DB95A721A  # 0.411403113374 521
+    .quad 0x03FDA546DB95A721A
+    .quad 0x03FDA5D7C16257437  # 0.411955854060 522
+    .quad 0x03FDA5D7C16257437
+    .quad 0x03FDA69913B2F6572  # 0.412693317221 523
+    .quad 0x03FDA69913B2F6572
+    .quad 0x03FDA72A2966BE1EA  # 0.413246771713 524
+    .quad 0x03FDA72A2966BE1EA
+    .quad 0x03FDA7EBBBAB46E8B  # 0.413985187844 525
+    .quad 0x03FDA7EBBBAB46E8B
+    .quad 0x03FDA87D0165DD199  # 0.414539357989 526
+    .quad 0x03FDA87D0165DD199
+    .quad 0x03FDA93ED3C8AD9E3  # 0.415278729556 527
+    .quad 0x03FDA93ED3C8AD9E3
+    .quad 0x03FDA9D049A9E884A  # 0.415833617206 528
+    .quad 0x03FDA9D049A9E884A
+    .quad 0x03FDAA925C5588EFA  # 0.416573946686 529
+    .quad 0x03FDAA925C5588EFA
+    .quad 0x03FDAB24027D5E8AF  # 0.417129553701 530
+    .quad 0x03FDAB24027D5E8AF
+    .quad 0x03FDABE6559C8167C  # 0.417870843580 531
+    .quad 0x03FDABE6559C8167C
+    .quad 0x03FDAC782C2B07944  # 0.418427171828 532
+    .quad 0x03FDAC782C2B07944
+    .quad 0x03FDAD3ABFE88A06E  # 0.419169424599 533
+    .quad 0x03FDAD3ABFE88A06E
+    .quad 0x03FDADCCC6FDF6A80  # 0.419726475955 534
+    .quad 0x03FDADCCC6FDF6A80
+    .quad 0x03FDAE5EE2E961227  # 0.420283837790 535
+    .quad 0x03FDAE5EE2E961227
+    .quad 0x03FDAF21D34189D0A  # 0.421027470470 536
+    .quad 0x03FDAF21D34189D0A
+    .quad 0x03FDAFB41FE2167B4  # 0.421585558104 537
+    .quad 0x03FDAFB41FE2167B4
+    .quad 0x03FDB07751416A7F3  # 0.422330159776 538
+    .quad 0x03FDB07751416A7F3
+    .quad 0x03FDB109CEB79DB8A  # 0.422888975102 539
+    .quad 0x03FDB109CEB79DB8A
+    .quad 0x03FDB1CD41498DF12  # 0.423634548296 540
+    .quad 0x03FDB1CD41498DF12
+    .quad 0x03FDB25FEFB60CB2E  # 0.424194093214 541
+    .quad 0x03FDB25FEFB60CB2E
+    .quad 0x03FDB323A3A63594A  # 0.424940640468 542
+    .quad 0x03FDB323A3A63594A
+    .quad 0x03FDB3B68329C59E9  # 0.425500916886 543
+    .quad 0x03FDB3B68329C59E9
+    .quad 0x03FDB44977C148F1A  # 0.426061507389 544
+    .quad 0x03FDB44977C148F1A
+    .quad 0x03FDB50D895F7773A  # 0.426809450580 545
+    .quad 0x03FDB50D895F7773A
+    .quad 0x03FDB5A0AF3D169CD  # 0.427370775322 546
+    .quad 0x03FDB5A0AF3D169CD
+    .quad 0x03FDB66502A41E541  # 0.428119698779 547
+    .quad 0x03FDB66502A41E541
+    .quad 0x03FDB6F859E8EF639  # 0.428681759684 548
+    .quad 0x03FDB6F859E8EF639
+    .quad 0x03FDB78BC664238C0  # 0.429244136679 549
+    .quad 0x03FDB78BC664238C0
+    .quad 0x03FDB85078123E586  # 0.429994464983 550
+    .quad 0x03FDB85078123E586
+    .quad 0x03FDB8E41624226C5  # 0.430557580905 551
+    .quad 0x03FDB8E41624226C5
+    .quad 0x03FDB9A90A06BCB3D  # 0.431308895742 552
+    .quad 0x03FDB9A90A06BCB3D
+    .quad 0x03FDBA3CD9D0B81BD  # 0.431872752537 553
+    .quad 0x03FDBA3CD9D0B81BD
+    .quad 0x03FDBAD0BEF3DB164  # 0.432436927446 554
+    .quad 0x03FDBAD0BEF3DB164
+    .quad 0x03FDBB9611B80E2FC  # 0.433189656123 555
+    .quad 0x03FDBB9611B80E2FC
+    .quad 0x03FDBC2A28C33B75D  # 0.433754574696 556
+    .quad 0x03FDBC2A28C33B75D
+    .quad 0x03FDBCBE553C2BDDF  # 0.434319812582 557
+    .quad 0x03FDBCBE553C2BDDF
+    .quad 0x03FDBD84073D8EC2B  # 0.435073960430 558
+    .quad 0x03FDBD84073D8EC2B
+    .quad 0x03FDBE1865CEC1EC9  # 0.435639944787 559
+    .quad 0x03FDBE1865CEC1EC9
+    .quad 0x03FDBEACD9E271AD1  # 0.436206249662 560
+    .quad 0x03FDBEACD9E271AD1
+    .quad 0x03FDBF72EB7D20355  # 0.436961822044 561
+    .quad 0x03FDBF72EB7D20355
+    .quad 0x03FDC00791D99132B  # 0.437528876213 562
+    .quad 0x03FDC00791D99132B
+    .quad 0x03FDC09C4DCD565AB  # 0.438096252115 563
+    .quad 0x03FDC09C4DCD565AB
+    .quad 0x03FDC162BF5DF23E4  # 0.438853254422 564
+    .quad 0x03FDC162BF5DF23E4
+    .quad 0x03FDC1F7ADCB3DAB0  # 0.439421382456 565
+    .quad 0x03FDC1F7ADCB3DAB0
+    .quad 0x03FDC28CB1E4D32FD  # 0.439989833442 566
+    .quad 0x03FDC28CB1E4D32FD
+    .quad 0x03FDC35383C8850B0  # 0.440748271097 567
+    .quad 0x03FDC35383C8850B0
+    .quad 0x03FDC3E8BA8CACF27  # 0.441317477070 568
+    .quad 0x03FDC3E8BA8CACF27
+    .quad 0x03FDC47E071233744  # 0.441887007223 569
+    .quad 0x03FDC47E071233744
+    .quad 0x03FDC54539A6ABCD2  # 0.442646885679 570
+    .quad 0x03FDC54539A6ABCD2
+    .quad 0x03FDC5DAB908186FF  # 0.443217173690 571
+    .quad 0x03FDC5DAB908186FF
+    .quad 0x03FDC6704E4016FF7  # 0.443787787115 572
+    .quad 0x03FDC6704E4016FF7
+    .quad 0x03FDC737E1E38F4FB  # 0.444549111857 573
+    .quad 0x03FDC737E1E38F4FB
+    .quad 0x03FDC7CDAA290FEAD  # 0.445120486027 574
+    .quad 0x03FDC7CDAA290FEAD
+    .quad 0x03FDC863885A74D16  # 0.445692186852 575
+    .quad 0x03FDC863885A74D16
+    .quad 0x03FDC8F97C7E299DB  # 0.446264214707 576
+    .quad 0x03FDC8F97C7E299DB
+    .quad 0x03FDC9C18EDC7C26B  # 0.447027427871 577
+    .quad 0x03FDC9C18EDC7C26B
+    .quad 0x03FDCA57B64E9DB05  # 0.447600220249 578
+    .quad 0x03FDCA57B64E9DB05
+    .quad 0x03FDCAEDF3C88A364  # 0.448173340907 579
+    .quad 0x03FDCAEDF3C88A364
+    .quad 0x03FDCB844750B9995  # 0.448746790220 580
+    .quad 0x03FDCB844750B9995
+    .quad 0x03FDCC4CD90B3ECE5  # 0.449511901199 581
+    .quad 0x03FDCC4CD90B3ECE5
+    .quad 0x03FDCCE3602341C10  # 0.450086118843 582
+    .quad 0x03FDCCE3602341C10
+    .quad 0x03FDCD79FD5F2BC77  # 0.450660666403 583
+    .quad 0x03FDCD79FD5F2BC77
+    .quad 0x03FDCE10B0C581284  # 0.451235544257 584
+    .quad 0x03FDCE10B0C581284
+    .quad 0x03FDCED9C27EC6607  # 0.452002562511 585
+    .quad 0x03FDCED9C27EC6607
+    .quad 0x03FDCF70A9B6D3810  # 0.452578212532 586
+    .quad 0x03FDCF70A9B6D3810
+    .quad 0x03FDD007A72F19BBC  # 0.453154194116 587
+    .quad 0x03FDD007A72F19BBC
+    .quad 0x03FDD09EBAEE29DD8  # 0.453730507647 588
+    .quad 0x03FDD09EBAEE29DD8
+    .quad 0x03FDD1684D49F46AE  # 0.454499442710 589
+    .quad 0x03FDD1684D49F46AE
+    .quad 0x03FDD1FF951D1F1B3  # 0.455076532271 590
+    .quad 0x03FDD1FF951D1F1B3
+    .quad 0x03FDD296F34D0B65C  # 0.455653955057 591
+    .quad 0x03FDD296F34D0B65C
+    .quad 0x03FDD32E67E056BD5  # 0.456231711452 592
+    .quad 0x03FDD32E67E056BD5
+    .quad 0x03FDD3C5F2DDA1840  # 0.456809801843 593
+    .quad 0x03FDD3C5F2DDA1840
+    .quad 0x03FDD490246DEFA6A  # 0.457581109247 594
+    .quad 0x03FDD490246DEFA6A
+    .quad 0x03FDD527E3D1B95FC  # 0.458159980465 595
+    .quad 0x03FDD527E3D1B95FC
+    .quad 0x03FDD5BFB9B5AE71F  # 0.458739186968 596
+    .quad 0x03FDD5BFB9B5AE71F
+    .quad 0x03FDD657A6207C0DB  # 0.459318729146 597
+    .quad 0x03FDD657A6207C0DB
+    .quad 0x03FDD6EFA918D25CE  # 0.459898607388 598
+    .quad 0x03FDD6EFA918D25CE
+    .quad 0x03FDD7BA7AD9E7DA1  # 0.460672301817 599
+    .quad 0x03FDD7BA7AD9E7DA1
+    .quad 0x03FDD852B28BE5A0F  # 0.461252965726 600
+    .quad 0x03FDD852B28BE5A0F
+    .quad 0x03FDD8EB00E1CCE14  # 0.461833967001 601
+    .quad 0x03FDD8EB00E1CCE14
+    .quad 0x03FDD98365E25ABB9  # 0.462415306035 602
+    .quad 0x03FDD98365E25ABB9
+    .quad 0x03FDDA1BE1944F538  # 0.462996983220 603
+    .quad 0x03FDDA1BE1944F538
+    .quad 0x03FDDAE75484C9615  # 0.463773079495 604
+    .quad 0x03FDDAE75484C9615
+    .quad 0x03FDDB8005445488B  # 0.464355547233 605
+    .quad 0x03FDDB8005445488B
+    .quad 0x03FDDC18CCCBDCB83  # 0.464938354438 606
+    .quad 0x03FDDC18CCCBDCB83
+    .quad 0x03FDDCB1AB222F33D  # 0.465521501504 607
+    .quad 0x03FDDCB1AB222F33D
+    .quad 0x03FDDD4AA04E1C4B7  # 0.466104988830 608
+    .quad 0x03FDDD4AA04E1C4B7
+    .quad 0x03FDDDE3AC56775D2  # 0.466688816812 609
+    .quad 0x03FDDDE3AC56775D2
+    .quad 0x03FDDE7CCF4216D6E  # 0.467272985848 610
+    .quad 0x03FDDE7CCF4216D6E
+    .quad 0x03FDDF492177D7BBC  # 0.468052409114 611
+    .quad 0x03FDDF492177D7BBC
+    .quad 0x03FDDFE279E5BF4EE  # 0.468637375496 612
+    .quad 0x03FDDFE279E5BF4EE
+    .quad 0x03FDE07BE94DCC439  # 0.469222684263 613
+    .quad 0x03FDE07BE94DCC439
+    .quad 0x03FDE1156FB6E2626  # 0.469808335817 614
+    .quad 0x03FDE1156FB6E2626
+    .quad 0x03FDE1AF0D27E88D7  # 0.470394330560 615
+    .quad 0x03FDE1AF0D27E88D7
+    .quad 0x03FDE248C1A7C8C26  # 0.470980668894 616
+    .quad 0x03FDE248C1A7C8C26
+    .quad 0x03FDE2E28D3D701CC  # 0.471567351222 617
+    .quad 0x03FDE2E28D3D701CC
+    .quad 0x03FDE37C6FEFCED73  # 0.472154377948 618
+    .quad 0x03FDE37C6FEFCED73
+    .quad 0x03FDE449C232C39D8  # 0.472937616681 619
+    .quad 0x03FDE449C232C39D8
+    .quad 0x03FDE4E3DAEDDB5F6  # 0.473525448578 620
+    .quad 0x03FDE4E3DAEDDB5F6
+    .quad 0x03FDE57E0ADCE1EA5  # 0.474113626224 621
+    .quad 0x03FDE57E0ADCE1EA5
+    .quad 0x03FDE6185206D516F  # 0.474702150027 622
+    .quad 0x03FDE6185206D516F
+    .quad 0x03FDE6B2B072B5E6F  # 0.475291020395 623
+    .quad 0x03FDE6B2B072B5E6F
+    .quad 0x03FDE74D26278887A  # 0.475880237735 624
+    .quad 0x03FDE74D26278887A
+    .quad 0x03FDE7E7B32C5453F  # 0.476469802457 625
+    .quad 0x03FDE7E7B32C5453F
+    .quad 0x03FDE882578823D52  # 0.477059714970 626
+    .quad 0x03FDE882578823D52
+    .quad 0x03FDE91D134204C67  # 0.477649975686 627
+    .quad 0x03FDE91D134204C67
+    .quad 0x03FDE9B7E6610815A  # 0.478240585015 628
+    .quad 0x03FDE9B7E6610815A
+    .quad 0x03FDEA52D0EC41E5E  # 0.478831543369 629
+    .quad 0x03FDEA52D0EC41E5E
+    .quad 0x03FDEB218376ECFC0  # 0.479620031484 630
+    .quad 0x03FDEB218376ECFC0
+    .quad 0x03FDEBBCA4C4E9E87  # 0.480211805838 631
+    .quad 0x03FDEBBCA4C4E9E87
+    .quad 0x03FDEC57DD96CD0CB  # 0.480803930597 632
+    .quad 0x03FDEC57DD96CD0CB
+    .quad 0x03FDECF32DF3B887D  # 0.481396406174 633
+    .quad 0x03FDECF32DF3B887D
+    .quad 0x03FDED8E95E2D1B88  # 0.481989232987 634
+    .quad 0x03FDED8E95E2D1B88
+    .quad 0x03FDEE2A156B413E5  # 0.482582411453 635
+    .quad 0x03FDEE2A156B413E5
+    .quad 0x03FDEEC5AC9432FCB  # 0.483175941987 636
+    .quad 0x03FDEEC5AC9432FCB
+    .quad 0x03FDEF615B64D61C7  # 0.483769825010 637
+    .quad 0x03FDEF615B64D61C7
+    .quad 0x03FDEFFD21E45D0D1  # 0.484364060939 638
+    .quad 0x03FDEFFD21E45D0D1
+    .quad 0x03FDF0990019FD887  # 0.484958650194 639
+    .quad 0x03FDF0990019FD887
+    .quad 0x03FDF134F60CF092D  # 0.485553593197 640
+    .quad 0x03FDF134F60CF092D
+    .quad 0x03FDF1D103C4727E4  # 0.486148890367 641
+    .quad 0x03FDF1D103C4727E4
+    .quad 0x03FDF26D2947C2EC5  # 0.486744542127 642
+    .quad 0x03FDF26D2947C2EC5
+    .quad 0x03FDF309669E24CF9  # 0.487340548899 643
+    .quad 0x03FDF309669E24CF9
+    .quad 0x03FDF3A5BBCEDE6E1  # 0.487936911107 644
+    .quad 0x03FDF3A5BBCEDE6E1
+    .quad 0x03FDF44228E13963A  # 0.488533629176 645
+    .quad 0x03FDF44228E13963A
+    .quad 0x03FDF4DEADDC82A35  # 0.489130703529 646
+    .quad 0x03FDF4DEADDC82A35
+    .quad 0x03FDF57B4AC80A79A  # 0.489728134594 647
+    .quad 0x03FDF57B4AC80A79A
+    .quad 0x03FDF617FFAB248ED  # 0.490325922795 648
+    .quad 0x03FDF617FFAB248ED
+    .quad 0x03FDF6B4CC8D27E87  # 0.490924068561 649
+    .quad 0x03FDF6B4CC8D27E87
+    .quad 0x03FDF751B1756EEC8  # 0.491522572320 650
+    .quad 0x03FDF751B1756EEC8
+    .quad 0x03FDF7EEAE6B5761C  # 0.492121434499 651
+    .quad 0x03FDF7EEAE6B5761C
+    .quad 0x03FDF88BC3764273B  # 0.492720655530 652
+    .quad 0x03FDF88BC3764273B
+    .quad 0x03FDF928F09D94B32  # 0.493320235842 653
+    .quad 0x03FDF928F09D94B32
+    .quad 0x03FDF9C635E8B6192  # 0.493920175866 654
+    .quad 0x03FDF9C635E8B6192
+    .quad 0x03FDFA63935F1208C  # 0.494520476034 655
+    .quad 0x03FDFA63935F1208C
+    .quad 0x03FDFB0109081751A  # 0.495121136779 656
+    .quad 0x03FDFB0109081751A
+    .quad 0x03FDFB9E96EB38311  # 0.495722158534 657
+    .quad 0x03FDFB9E96EB38311
+    .quad 0x03FDFC3C3D0FEA555  # 0.496323541733 658
+    .quad 0x03FDFC3C3D0FEA555
+    .quad 0x03FDFCD9FB7DA6DEF  # 0.496925286812 659
+    .quad 0x03FDFCD9FB7DA6DEF
+    .quad 0x03FDFD77D23BEA634  # 0.497527394206 660
+    .quad 0x03FDFD77D23BEA634
+    .quad 0x03FDFE15C15234EE2  # 0.498129864352 661
+    .quad 0x03FDFE15C15234EE2
+    .quad 0x03FDFEB3C8C80A04E  # 0.498732697687 662
+    .quad 0x03FDFEB3C8C80A04E
+    .quad 0x03FDFF51E8A4F0A74  # 0.499335894649 663
+    .quad 0x03FDFF51E8A4F0A74
+    .quad 0x03FDFFF020F07352E  # 0.499939455677 664
+    .quad 0x03FDFFF020F07352E
+    .quad 0x03FE004738D910023  # 0.500543381211 665
+    .quad 0x03FE004738D910023
+    .quad 0x03FE00966D78C41CF  # 0.501147671692 666
+    .quad 0x03FE00966D78C41CF
+    .quad 0x03FE00E5AE5B207AB  # 0.501752327560 667
+    .quad 0x03FE00E5AE5B207AB
+    .quad 0x03FE011A8B18F0ED6  # 0.502155634684 668
+    .quad 0x03FE011A8B18F0ED6
+    .quad 0x03FE0169E072D7311  # 0.502760900515 669
+    .quad 0x03FE0169E072D7311
+    .quad 0x03FE01B942198A5A1  # 0.503366532915 670
+    .quad 0x03FE01B942198A5A1
+    .quad 0x03FE0208B010DB642  # 0.503972532327 671
+    .quad 0x03FE0208B010DB642
+    .quad 0x03FE02582A5C9D122  # 0.504578899198 672
+    .quad 0x03FE02582A5C9D122
+    .quad 0x03FE02A7B100A3EF0  # 0.505185633972 673
+    .quad 0x03FE02A7B100A3EF0
+    .quad 0x03FE02F74400C64EA  # 0.505792737097 674
+    .quad 0x03FE02F74400C64EA
+    .quad 0x03FE0346E360DC4F9  # 0.506400209020 675
+    .quad 0x03FE0346E360DC4F9
+    .quad 0x03FE03968F24BFDB6  # 0.507008050190 676
+    .quad 0x03FE03968F24BFDB6
+    .quad 0x03FE03E647504CA89  # 0.507616261055 677
+    .quad 0x03FE03E647504CA89
+    .quad 0x03FE04360BE7603AE  # 0.508224842066 678
+    .quad 0x03FE04360BE7603AE
+    .quad 0x03FE046B4089BE0FD  # 0.508630768599 679
+    .quad 0x03FE046B4089BE0FD
+    .quad 0x03FE04BB19DCA36B3  # 0.509239967521 680
+    .quad 0x03FE04BB19DCA36B3
+    .quad 0x03FE050AFFA5671A5  # 0.509849537793 681
+    .quad 0x03FE050AFFA5671A5
+    .quad 0x03FE055AF1E7ED47B  # 0.510459479867 682
+    .quad 0x03FE055AF1E7ED47B
+    .quad 0x03FE05AAF0A81BF04  # 0.511069794198 683
+    .quad 0x03FE05AAF0A81BF04
+    .quad 0x03FE05FAFBE9DAE58  # 0.511680481240 684
+    .quad 0x03FE05FAFBE9DAE58
+    .quad 0x03FE064B13B113CDD  # 0.512291541448 685
+    .quad 0x03FE064B13B113CDD
+    .quad 0x03FE069B3801B2263  # 0.512902975280 686
+    .quad 0x03FE069B3801B2263
+    .quad 0x03FE06D0AC85B63A2  # 0.513310805628 687
+    .quad 0x03FE06D0AC85B63A2
+    .quad 0x03FE0720E5C40DF1D  # 0.513922863181 688
+    .quad 0x03FE0720E5C40DF1D
+    .quad 0x03FE07712B9648153  # 0.514535295577 689
+    .quad 0x03FE07712B9648153
+    .quad 0x03FE07C17E0056E7C  # 0.515148103277 690
+    .quad 0x03FE07C17E0056E7C
+    .quad 0x03FE0811DD062E889  # 0.515761286740 691
+    .quad 0x03FE0811DD062E889
+    .quad 0x03FE086248ABC4F3B  # 0.516374846428 692
+    .quad 0x03FE086248ABC4F3B
+    .quad 0x03FE08B2C0F512033  # 0.516988782802 693
+    .quad 0x03FE08B2C0F512033
+    .quad 0x03FE08E86D82DA3EE  # 0.517398283218 694
+    .quad 0x03FE08E86D82DA3EE
+    .quad 0x03FE0938FAE5D8E9B  # 0.518012848432 695
+    .quad 0x03FE0938FAE5D8E9B
+    .quad 0x03FE098994F72C539  # 0.518627791569 696
+    .quad 0x03FE098994F72C539
+    .quad 0x03FE09DA3BBAD339C  # 0.519243113094 697
+    .quad 0x03FE09DA3BBAD339C
+    .quad 0x03FE0A2AEF34CE3D1  # 0.519858813473 698
+    .quad 0x03FE0A2AEF34CE3D1
+    .quad 0x03FE0A7BAF691FE34  # 0.520474893172 699
+    .quad 0x03FE0A7BAF691FE34
+    .quad 0x03FE0AB18BF5823C3  # 0.520885823936 700
+    .quad 0x03FE0AB18BF5823C3
+    .quad 0x03FE0B02616952989  # 0.521502536876 701
+    .quad 0x03FE0B02616952989
+    .quad 0x03FE0B5343A234476  # 0.522119630385 702
+    .quad 0x03FE0B5343A234476
+    .quad 0x03FE0BA432A430CA2  # 0.522737104934 703
+    .quad 0x03FE0BA432A430CA2
+    .quad 0x03FE0BF52E73538CE  # 0.523354960993 704
+    .quad 0x03FE0BF52E73538CE
+    .quad 0x03FE0C463713A9E6F  # 0.523973199034 705
+    .quad 0x03FE0C463713A9E6F
+    .quad 0x03FE0C7C43F4C861E  # 0.524385570174 706
+    .quad 0x03FE0C7C43F4C861E
+    .quad 0x03FE0CCD61FAD07D2  # 0.525004445903 707
+    .quad 0x03FE0CCD61FAD07D2
+    .quad 0x03FE0D1E8CDCE3DB6  # 0.525623704876 708
+    .quad 0x03FE0D1E8CDCE3DB6
+    .quad 0x03FE0D6FC49F16E93  # 0.526243347569 709
+    .quad 0x03FE0D6FC49F16E93
+    .quad 0x03FE0DC109458004A  # 0.526863374456 710
+    .quad 0x03FE0DC109458004A
+    .quad 0x03FE0DF73E353F0ED  # 0.527276939392 711
+    .quad 0x03FE0DF73E353F0ED
+    .quad 0x03FE0E4898611CCE1  # 0.527897607665 712
+    .quad 0x03FE0E4898611CCE1
+    .quad 0x03FE0E99FF7C20738  # 0.528518661406 713
+    .quad 0x03FE0E99FF7C20738
+    .quad 0x03FE0EEB738A67874  # 0.529140101094 714
+    .quad 0x03FE0EEB738A67874
+    .quad 0x03FE0F21C81D1ADC3  # 0.529554608872 715
+    .quad 0x03FE0F21C81D1ADC3
+    .quad 0x03FE0F7351C9FCD7F  # 0.530176692874 716
+    .quad 0x03FE0F7351C9FCD7F
+    .quad 0x03FE0FC4E875254C1  # 0.530799164104 717
+    .quad 0x03FE0FC4E875254C1
+    .quad 0x03FE10168C22B8FB9  # 0.531422023047 718
+    .quad 0x03FE10168C22B8FB9
+    .quad 0x03FE10683CD6DEA54  # 0.532045270185 719
+    .quad 0x03FE10683CD6DEA54
+    .quad 0x03FE109EB9E2E4C97  # 0.532460984179 720
+    .quad 0x03FE109EB9E2E4C97
+    .quad 0x03FE10F08055E7785  # 0.533084879385 721
+    .quad 0x03FE10F08055E7785
+    .quad 0x03FE114253DA97DA0  # 0.533709164079 722
+    .quad 0x03FE114253DA97DA0
+    .quad 0x03FE1194347523FDC  # 0.534333838748 723
+    .quad 0x03FE1194347523FDC
+    .quad 0x03FE11CAD1789B0F8  # 0.534750505421 724
+    .quad 0x03FE11CAD1789B0F8
+    .quad 0x03FE121CC7EB8F7E6  # 0.535375831132 725
+    .quad 0x03FE121CC7EB8F7E6
+    .quad 0x03FE126ECB7F8F007  # 0.536001548120 726
+    .quad 0x03FE126ECB7F8F007
+    .quad 0x03FE12A57FDA37091  # 0.536418910396 727
+    .quad 0x03FE12A57FDA37091
+    .quad 0x03FE12F799594EFBC  # 0.537045280601 728
+    .quad 0x03FE12F799594EFBC
+    .quad 0x03FE1349C004AFB00  # 0.537672043392 729
+    .quad 0x03FE1349C004AFB00
+    .quad 0x03FE139BF3E094003  # 0.538299199261 730
+    .quad 0x03FE139BF3E094003
+    .quad 0x03FE13D2C873C5E13  # 0.538717521794 731
+    .quad 0x03FE13D2C873C5E13
+    .quad 0x03FE142512549C16C  # 0.539345333889 732
+    .quad 0x03FE142512549C16C
+    .quad 0x03FE14776971477F1  # 0.539973540381 733
+    .quad 0x03FE14776971477F1
+    .quad 0x03FE14C9CDCE0A74D  # 0.540602141763 734
+    .quad 0x03FE14C9CDCE0A74D
+    .quad 0x03FE1500C2BFD1561  # 0.541021428981 735
+    .quad 0x03FE1500C2BFD1561
+    .quad 0x03FE15533D3B8D7B3  # 0.541650689621 736
+    .quad 0x03FE15533D3B8D7B3
+    .quad 0x03FE15A5C502C6DC5  # 0.542280346478 737
+    .quad 0x03FE15A5C502C6DC5
+    .quad 0x03FE15DCD1973457B  # 0.542700338085 738
+    .quad 0x03FE15DCD1973457B
+    .quad 0x03FE162F6F9071F76  # 0.543330656416 739
+    .quad 0x03FE162F6F9071F76
+    .quad 0x03FE16821AE0A13C6  # 0.543961372300 740
+    .quad 0x03FE16821AE0A13C6
+    .quad 0x03FE16B93F2C12808  # 0.544382070665 741
+    .quad 0x03FE16B93F2C12808
+    .quad 0x03FE170C00C169B51  # 0.545013450251 742
+    .quad 0x03FE170C00C169B51
+    .quad 0x03FE175ECFB935CC6  # 0.545645228728 743
+    .quad 0x03FE175ECFB935CC6
+    .quad 0x03FE17B1AC17CBD5B  # 0.546277406602 744
+    .quad 0x03FE17B1AC17CBD5B
+    .quad 0x03FE17E8F12052E8A  # 0.546699080654 745
+    .quad 0x03FE17E8F12052E8A
+    .quad 0x03FE183BE3DE8A7AF  # 0.547331925312 746
+    .quad 0x03FE183BE3DE8A7AF
+    .quad 0x03FE188EE40F23CA7  # 0.547965170715 747
+    .quad 0x03FE188EE40F23CA7
+    .quad 0x03FE18C640FF75F06  # 0.548387557205 748
+    .quad 0x03FE18C640FF75F06
+    .quad 0x03FE191957A30FA51  # 0.549021471648 749
+    .quad 0x03FE191957A30FA51
+    .quad 0x03FE196C7BC4B1F3A  # 0.549655788193 750
+    .quad 0x03FE196C7BC4B1F3A
+    .quad 0x03FE19A3F0B1860BD  # 0.550078889532 751
+    .quad 0x03FE19A3F0B1860BD
+    .quad 0x03FE19F72B59A0CEC  # 0.550713877383 752
+    .quad 0x03FE19F72B59A0CEC
+    .quad 0x03FE1A4A738B7A33C  # 0.551349268700 753
+    .quad 0x03FE1A4A738B7A33C
+    .quad 0x03FE1A820089A2156  # 0.551773087312 754
+    .quad 0x03FE1A820089A2156
+    .quad 0x03FE1AD55F55855C8  # 0.552409152212 755
+    .quad 0x03FE1AD55F55855C8
+    .quad 0x03FE1B28CBB6EC93E  # 0.553045621948 756
+    .quad 0x03FE1B28CBB6EC93E
+    .quad 0x03FE1B6070DB553D8  # 0.553470160269 757
+    .quad 0x03FE1B6070DB553D8
+    .quad 0x03FE1BB3F3EA714F6  # 0.554107305878 758
+    .quad 0x03FE1BB3F3EA714F6
+    .quad 0x03FE1BEBA8316EF2C  # 0.554532295260 759
+    .quad 0x03FE1BEBA8316EF2C
+    .quad 0x03FE1C3F41FA97C6B  # 0.555170118179 760
+    .quad 0x03FE1C3F41FA97C6B
+    .quad 0x03FE1C92E96C86020  # 0.555808348176 761
+    .quad 0x03FE1C92E96C86020
+    .quad 0x03FE1CCAB5FBFFEE1  # 0.556234061252 762
+    .quad 0x03FE1CCAB5FBFFEE1
+    .quad 0x03FE1D1E743BCFC47  # 0.556872970868 763
+    .quad 0x03FE1D1E743BCFC47
+    .quad 0x03FE1D72403052E75  # 0.557512288951 764
+    .quad 0x03FE1D72403052E75
+    .quad 0x03FE1DAA251D7E433  # 0.557938728190 765
+    .quad 0x03FE1DAA251D7E433
+    .quad 0x03FE1DFE07F3D1DAB  # 0.558578728212 766
+    .quad 0x03FE1DFE07F3D1DAB
+    .quad 0x03FE1E35FC265D75E  # 0.559005622562 767
+    .quad 0x03FE1E35FC265D75E
+    .quad 0x03FE1E89F5EB04126  # 0.559646305979 768
+    .quad 0x03FE1E89F5EB04126
+    .quad 0x03FE1EDDFD77E1FEF  # 0.560287400135 769
+    .quad 0x03FE1EDDFD77E1FEF
+    .quad 0x03FE1F160A2AD0DA3  # 0.560715024687 770
+    .quad 0x03FE1F160A2AD0DA3
+    .quad 0x03FE1F6A28BA1B476  # 0.561356804579 771
+    .quad 0x03FE1F6A28BA1B476
+    .quad 0x03FE1FBE551DB43C1  # 0.561998996616 772
+    .quad 0x03FE1FBE551DB43C1
+    .quad 0x03FE1FF67A6684F47  # 0.562427353873 773
+    .quad 0x03FE1FF67A6684F47
+    .quad 0x03FE204ABDE0BE5DF  # 0.563070233998 774
+    .quad 0x03FE204ABDE0BE5DF
+    .quad 0x03FE2082F29233211  # 0.563499050471 775
+    .quad 0x03FE2082F29233211
+    .quad 0x03FE20D74D2FBAFE4  # 0.564142620160 776
+    .quad 0x03FE20D74D2FBAFE4
+    .quad 0x03FE210F91524B469  # 0.564571896835 777
+    .quad 0x03FE210F91524B469
+    .quad 0x03FE2164031FDA0B0  # 0.565216157568 778
+    .quad 0x03FE2164031FDA0B0
+    .quad 0x03FE21B882DD26040  # 0.565860833641 779
+    .quad 0x03FE21B882DD26040
+    .quad 0x03FE21F0DFC65CEEC  # 0.566290848698 780
+    .quad 0x03FE21F0DFC65CEEC
+    .quad 0x03FE224576C81FFE0  # 0.566936218194 781
+    .quad 0x03FE224576C81FFE0
+    .quad 0x03FE227DE33896A44  # 0.567366696031 782
+    .quad 0x03FE227DE33896A44
+    .quad 0x03FE22D2918BA4A31  # 0.568012760445 783
+    .quad 0x03FE22D2918BA4A31
+    .quad 0x03FE23274DE272A83  # 0.568659242528 784
+    .quad 0x03FE23274DE272A83
+    .quad 0x03FE235FD33D232FC  # 0.569090462888 785
+    .quad 0x03FE235FD33D232FC
+    .quad 0x03FE23B4A6F9D8688  # 0.569737642287 786
+    .quad 0x03FE23B4A6F9D8688
+    .quad 0x03FE23ED3BF21CA33  # 0.570169328026 787
+    .quad 0x03FE23ED3BF21CA33
+    .quad 0x03FE24422721A89D7  # 0.570817206248 788
+    .quad 0x03FE24422721A89D7
+    .quad 0x03FE247ACBC023D2B  # 0.571249358372 789
+    .quad 0x03FE247ACBC023D2B
+    .quad 0x03FE24CFCE6F80D9B  # 0.571897936927 790
+    .quad 0x03FE24CFCE6F80D9B
+    .quad 0x03FE250882BCDD7D8  # 0.572330556445 791
+    .quad 0x03FE250882BCDD7D8
+    .quad 0x03FE255D9CF910A56  # 0.572979836849 792
+    .quad 0x03FE255D9CF910A56
+    .quad 0x03FE25B2C55CD5762  # 0.573629539091 793
+    .quad 0x03FE25B2C55CD5762
+    .quad 0x03FE25EB92D41992D  # 0.574062908546 794
+    .quad 0x03FE25EB92D41992D
+    .quad 0x03FE2640D2D99FFEA  # 0.574713315073 795
+    .quad 0x03FE2640D2D99FFEA
+    .quad 0x03FE2679B0166F51C  # 0.575147154559 796
+    .quad 0x03FE2679B0166F51C
+    .quad 0x03FE26CF07CAD8B00  # 0.575798266899 797
+    .quad 0x03FE26CF07CAD8B00
+    .quad 0x03FE2707F4D5F7C40  # 0.576232577438 798
+    .quad 0x03FE2707F4D5F7C40
+    .quad 0x03FE275D644670606  # 0.576884397124 799
+    .quad 0x03FE275D644670606
+    .quad 0x03FE27966128AB11B  # 0.577319179739 800
+    .quad 0x03FE27966128AB11B
+    .quad 0x03FE27EBE8626A387  # 0.577971708311 801
+    .quad 0x03FE27EBE8626A387
+    .quad 0x03FE2824F52493BD2  # 0.578406964030 802
+    .quad 0x03FE2824F52493BD2
+    .quad 0x03FE287A9434DBC7B  # 0.579060203030 803
+    .quad 0x03FE287A9434DBC7B
+    .quad 0x03FE28B3B0DFCEB80  # 0.579495932884 804
+    .quad 0x03FE28B3B0DFCEB80
+    .quad 0x03FE290967D3ED18D  # 0.580149883861 805
+    .quad 0x03FE290967D3ED18D
+    .quad 0x03FE294294708B773  # 0.580586088885 806
+    .quad 0x03FE294294708B773
+    .quad 0x03FE29986355D8C69  # 0.581240753393 807
+    .quad 0x03FE29986355D8C69
+    .quad 0x03FE29D19FED0C082  # 0.581677434622 808
+    .quad 0x03FE29D19FED0C082
+    .quad 0x03FE2A2786D0EC107  # 0.582332814220 809
+    .quad 0x03FE2A2786D0EC107
+    .quad 0x03FE2A60D36BA5253  # 0.582769972697 810
+    .quad 0x03FE2A60D36BA5253
+    .quad 0x03FE2AB6D25B86EF7  # 0.583426068948 811
+    .quad 0x03FE2AB6D25B86EF7
+    .quad 0x03FE2AF02F02BE4AB  # 0.583863705716 812
+    .quad 0x03FE2AF02F02BE4AB
+    .quad 0x03FE2B46460C1C2B3  # 0.584520520190 813
+    .quad 0x03FE2B46460C1C2B3
+    .quad 0x03FE2B7FB2C8D1CC1  # 0.584958636297 814
+    .quad 0x03FE2B7FB2C8D1CC1
+    .quad 0x03FE2BD5E1F9316F2  # 0.585616170568 815
+    .quad 0x03FE2BD5E1F9316F2
+    .quad 0x03FE2C0F5ED46CE8D  # 0.586054767066 816
+    .quad 0x03FE2C0F5ED46CE8D
+    .quad 0x03FE2C65A6395F5F5  # 0.586713022712 817
+    .quad 0x03FE2C65A6395F5F5
+    .quad 0x03FE2C9F333C2FE1E  # 0.587152100656 818
+    .quad 0x03FE2C9F333C2FE1E
+    .quad 0x03FE2CF592E351AE5  # 0.587811079263 819
+    .quad 0x03FE2CF592E351AE5
+    .quad 0x03FE2D2F3016CE0EF  # 0.588250639709 820
+    .quad 0x03FE2D2F3016CE0EF
+    .quad 0x03FE2D85A80DC7324  # 0.588910342867 821
+    .quad 0x03FE2D85A80DC7324
+    .quad 0x03FE2DBF557B0DF43  # 0.589350386878 822
+    .quad 0x03FE2DBF557B0DF43
+    .quad 0x03FE2E15E5CF91FA7  # 0.590010816181 823
+    .quad 0x03FE2E15E5CF91FA7
+    .quad 0x03FE2E4FA37FC9577  # 0.590451344823 824
+    .quad 0x03FE2E4FA37FC9577
+    .quad 0x03FE2E8967B3BF4E1  # 0.590892067615 825
+    .quad 0x03FE2E8967B3BF4E1
+    .quad 0x03FE2EE01A3BED567  # 0.591553516212 826
+    .quad 0x03FE2EE01A3BED567
+    .quad 0x03FE2F19EEBFB00BA  # 0.591994725131 827
+    .quad 0x03FE2F19EEBFB00BA
+    .quad 0x03FE2F70B9C67A7C2  # 0.592656903723 828
+    .quad 0x03FE2F70B9C67A7C2
+    .quad 0x03FE2FAA9EA342D04  # 0.593098599843 829
+    .quad 0x03FE2FAA9EA342D04
+    .quad 0x03FE3001823684D73  # 0.593761510043 830
+    .quad 0x03FE3001823684D73
+    .quad 0x03FE303B7775937EF  # 0.594203694441 831
+    .quad 0x03FE303B7775937EF
+    .quad 0x03FE309273A3340FC  # 0.594867337868 832
+    .quad 0x03FE309273A3340FC
+    .quad 0x03FE30CC794DD19D0  # 0.595310011625 833
+    .quad 0x03FE30CC794DD19D0
+    .quad 0x03FE3106858C76BB7  # 0.595752881428 834
+    .quad 0x03FE3106858C76BB7
+    .quad 0x03FE315DA4434068B  # 0.596417554101 835
+    .quad 0x03FE315DA4434068B
+    .quad 0x03FE3197C0FA80E6A  # 0.596860914783 836
+    .quad 0x03FE3197C0FA80E6A
+    .quad 0x03FE31EEF86D36EF1  # 0.597526324589 837
+    .quad 0x03FE31EEF86D36EF1
+    .quad 0x03FE322925A66E62D  # 0.597970177237 838
+    .quad 0x03FE322925A66E62D
+    .quad 0x03FE328075E32022F  # 0.598636325813 839
+    .quad 0x03FE328075E32022F
+    .quad 0x03FE32BAB3A7B21E9  # 0.599080671521 840
+    .quad 0x03FE32BAB3A7B21E9
+    .quad 0x03FE32F4F80D0B1BD  # 0.599525214760 841
+    .quad 0x03FE32F4F80D0B1BD
+    .quad 0x03FE334C6B15D30DD  # 0.600192400374 842
+    .quad 0x03FE334C6B15D30DD
+    .quad 0x03FE3386C013B90D6  # 0.600637438209 843
+    .quad 0x03FE3386C013B90D6
+    .quad 0x03FE33DE4C086C40A  # 0.601305366543 844
+    .quad 0x03FE33DE4C086C40A
+    .quad 0x03FE3418B1A85622C  # 0.601750900077 845
+    .quad 0x03FE3418B1A85622C
+    .quad 0x03FE34531DF21CFE3  # 0.602196632199 846
+    .quad 0x03FE34531DF21CFE3
+    .quad 0x03FE34AACCE299BA5  # 0.602865603124 847
+    .quad 0x03FE34AACCE299BA5
+    .quad 0x03FE34E549DBB21EF  # 0.603311832493 848
+    .quad 0x03FE34E549DBB21EF
+    .quad 0x03FE353D11DA4F855  # 0.603981550121 849
+    .quad 0x03FE353D11DA4F855
+    .quad 0x03FE35779F8C43D6D  # 0.604428277847 850
+    .quad 0x03FE35779F8C43D6D
+    .quad 0x03FE35B233F13DD4A  # 0.604875205229 851
+    .quad 0x03FE35B233F13DD4A
+    .quad 0x03FE360A1F1BBA738  # 0.605545971045 852
+    .quad 0x03FE360A1F1BBA738
+    .quad 0x03FE3644C446F97BC  # 0.605993398346 853
+    .quad 0x03FE3644C446F97BC
+    .quad 0x03FE367F702A9EA94  # 0.606441025927 854
+    .quad 0x03FE367F702A9EA94
+    .quad 0x03FE36D77E9D34FD7  # 0.607112843218 855
+    .quad 0x03FE36D77E9D34FD7
+    .quad 0x03FE37123B54987B7  # 0.607560972287 856
+    .quad 0x03FE37123B54987B7
+    .quad 0x03FE376A630C0A1D6  # 0.608233542652 857
+    .quad 0x03FE376A630C0A1D6
+    .quad 0x03FE37A530A0D5A31  # 0.608682174333 858
+    .quad 0x03FE37A530A0D5A31
+    .quad 0x03FE37E004F74E13B  # 0.609131007374 859
+    .quad 0x03FE37E004F74E13B
+    .quad 0x03FE383850278CFD9  # 0.609804634884 860
+    .quad 0x03FE383850278CFD9
+    .quad 0x03FE3873356902AB7  # 0.610253972119 861
+    .quad 0x03FE3873356902AB7
+    .quad 0x03FE38AE2171976E8  # 0.610703511349 862
+    .quad 0x03FE38AE2171976E8
+    .quad 0x03FE390690373AFFF  # 0.611378199331 863
+    .quad 0x03FE390690373AFFF
+    .quad 0x03FE39418D3872A53  # 0.611828244343 864
+    .quad 0x03FE39418D3872A53
+    .quad 0x03FE397C91064221F  # 0.612278491987 865
+    .quad 0x03FE397C91064221F
+    .quad 0x03FE39D5237E045A5  # 0.612954243787 866
+    .quad 0x03FE39D5237E045A5
+    .quad 0x03FE3A1038522CE82  # 0.613404998809 867
+    .quad 0x03FE3A1038522CE82
+    .quad 0x03FE3A68E45AD354B  # 0.614081512534 868
+    .quad 0x03FE3A68E45AD354B
+    .quad 0x03FE3AA40A3F2A68B  # 0.614532776080 869
+    .quad 0x03FE3AA40A3F2A68B
+    .quad 0x03FE3ADF36F98A182  # 0.614984243356 870
+    .quad 0x03FE3ADF36F98A182
+    .quad 0x03FE3B3806E5DF340  # 0.615661826668 871
+    .quad 0x03FE3B3806E5DF340
+    .quad 0x03FE3B7344BE40311  # 0.616113804077 872
+    .quad 0x03FE3B7344BE40311
+    .quad 0x03FE3BAE897234A87  # 0.616565985862 873
+    .quad 0x03FE3BAE897234A87
+    .quad 0x03FE3C077D5F51881  # 0.617244642149 874
+    .quad 0x03FE3C077D5F51881
+    .quad 0x03FE3C42D33F2AE7B  # 0.617697335683 875
+    .quad 0x03FE3C42D33F2AE7B
+    .quad 0x03FE3C7E30002960C  # 0.618150234241 876
+    .quad 0x03FE3C7E30002960C
+    .quad 0x03FE3CD7480B4A8A3  # 0.618829966906 877
+    .quad 0x03FE3CD7480B4A8A3
+    .quad 0x03FE3D12B60622748  # 0.619283378838 878
+    .quad 0x03FE3D12B60622748
+    .quad 0x03FE3D4E2AE7B7E2B  # 0.619736996447 879
+    .quad 0x03FE3D4E2AE7B7E2B
+    .quad 0x03FE3D89A6B1A558D  # 0.620190819917 880
+    .quad 0x03FE3D89A6B1A558D
+    .quad 0x03FE3DE2ED57B1F9B  # 0.620871941524 881
+    .quad 0x03FE3DE2ED57B1F9B
+    .quad 0x03FE3E1E7A6D8330E  # 0.621326280468 882
+    .quad 0x03FE3E1E7A6D8330E
+    .quad 0x03FE3E5A0E714DA6E  # 0.621780825931 883
+    .quad 0x03FE3E5A0E714DA6E
+    .quad 0x03FE3EB37978B85B6  # 0.622463031756 884
+    .quad 0x03FE3EB37978B85B6
+    .quad 0x03FE3EEF1ED68236B  # 0.622918094335 885
+    .quad 0x03FE3EEF1ED68236B
+    .quad 0x03FE3F2ACB27ED6C7  # 0.623373364090 886
+    .quad 0x03FE3F2ACB27ED6C7
+    .quad 0x03FE3F845AAE68C81  # 0.624056657591 887
+    .quad 0x03FE3F845AAE68C81
+    .quad 0x03FE3FC0186800514  # 0.624512446113 888
+    .quad 0x03FE3FC0186800514
+    .quad 0x03FE3FFBDD1AE8406  # 0.624968442473 889
+    .quad 0x03FE3FFBDD1AE8406
+    .quad 0x03FE4037A8C8C197A  # 0.625424646860 890
+    .quad 0x03FE4037A8C8C197A
+    .quad 0x03FE409167679DD99  # 0.626109343909 891
+    .quad 0x03FE409167679DD99
+    .quad 0x03FE40CD448FF6DD6  # 0.626566069196 892
+    .quad 0x03FE40CD448FF6DD6
+    .quad 0x03FE410928B8F950F  # 0.627023003177 893
+    .quad 0x03FE410928B8F950F
+    .quad 0x03FE41630C1B50AFF  # 0.627708795866 894
+    .quad 0x03FE41630C1B50AFF
+    .quad 0x03FE419F01CD27AD0  # 0.628166252416 895
+    .quad 0x03FE419F01CD27AD0
+    .quad 0x03FE41DAFE85672B9  # 0.628623918328 896
+    .quad 0x03FE41DAFE85672B9
+    .quad 0x03FE42170245B4C6A  # 0.629081793794 897
+    .quad 0x03FE42170245B4C6A
+    .quad 0x03FE42711518DF546  # 0.629769000326 898
+    .quad 0x03FE42711518DF546
+    .quad 0x03FE42AD2A74888A0  # 0.630227400518 899
+    .quad 0x03FE42AD2A74888A0
+    .quad 0x03FE42E946DE080C0  # 0.630686010936 900
+    .quad 0x03FE42E946DE080C0
+    .quad 0x03FE43437EB9D9424  # 0.631374321162 901
+    .quad 0x03FE43437EB9D9424
+    .quad 0x03FE437FACCD31C10  # 0.631833457993 902
+    .quad 0x03FE437FACCD31C10
+    .quad 0x03FE43BBE1F42FE09  # 0.632292805727 903
+    .quad 0x03FE43BBE1F42FE09
+    .quad 0x03FE43F81E307DE5E  # 0.632752364559 904
+    .quad 0x03FE43F81E307DE5E
+    .quad 0x03FE445285D68EA69  # 0.633442099038 905
+    .quad 0x03FE445285D68EA69
+    .quad 0x03FE448ED3CF71355  # 0.633902186463 906
+    .quad 0x03FE448ED3CF71355
+    .quad 0x03FE44CB28E37C3EE  # 0.634362485666 907
+    .quad 0x03FE44CB28E37C3EE
+    .quad 0x03FE450785145CAFE  # 0.634822996841 908
+    .quad 0x03FE450785145CAFE
+    .quad 0x03FE45621CB769366  # 0.635514161481 909
+    .quad 0x03FE45621CB769366
+    .quad 0x03FE459E8AB7B799D  # 0.635975203444 910
+    .quad 0x03FE459E8AB7B799D
+    .quad 0x03FE45DAFFDABD4DB  # 0.636436458065 911
+    .quad 0x03FE45DAFFDABD4DB
+    .quad 0x03FE46177C2229EC0  # 0.636897925539 912
+    .quad 0x03FE46177C2229EC0
+    .quad 0x03FE467243F53F69E  # 0.637590526283 913
+    .quad 0x03FE467243F53F69E
+    .quad 0x03FE46AED21F117FC  # 0.638052526753 914
+    .quad 0x03FE46AED21F117FC
+    .quad 0x03FE46EB677335D13  # 0.638514740766 915
+    .quad 0x03FE46EB677335D13
+    .quad 0x03FE472803F35EAAE  # 0.638977168520 916
+    .quad 0x03FE472803F35EAAE
+    .quad 0x03FE4764A7A13EF3B  # 0.639439810212 917
+    .quad 0x03FE4764A7A13EF3B
+    .quad 0x03FE47BFAA9F80271  # 0.640134174319 918
+    .quad 0x03FE47BFAA9F80271
+    .quad 0x03FE47FC60471DAF8  # 0.640597351724 919
+    .quad 0x03FE47FC60471DAF8
+    .quad 0x03FE48391D226992D  # 0.641060743762 920
+    .quad 0x03FE48391D226992D
+    .quad 0x03FE4875E1331971E  # 0.641524350631 921
+    .quad 0x03FE4875E1331971E
+    .quad 0x03FE48D114D3FB884  # 0.642220164181 922
+    .quad 0x03FE48D114D3FB884
+    .quad 0x03FE490DEAF1A3FC8  # 0.642684309003 923
+    .quad 0x03FE490DEAF1A3FC8
+    .quad 0x03FE494AC84AB0ED3  # 0.643148669355 924
+    .quad 0x03FE494AC84AB0ED3
+    .quad 0x03FE4987ACE0DABB0  # 0.643613245438 925
+    .quad 0x03FE4987ACE0DABB0
+    .quad 0x03FE49C498B5DA63F  # 0.644078037452 926
+    .quad 0x03FE49C498B5DA63F
+    .quad 0x03FE4A20080EF10B2  # 0.644775630783 927
+    .quad 0x03FE4A20080EF10B2
+    .quad 0x03FE4A5D060894B8C  # 0.645240963504 928
+    .quad 0x03FE4A5D060894B8C
+    .quad 0x03FE4A9A0B471A943  # 0.645706512861 929
+    .quad 0x03FE4A9A0B471A943
+    .quad 0x03FE4AD717CC3E626  # 0.646172279055 930
+    .quad 0x03FE4AD717CC3E626
+    .quad 0x03FE4B142B99BC871  # 0.646638262288 931
+    .quad 0x03FE4B142B99BC871
+    .quad 0x03FE4B6FD6F970C1F  # 0.647337644529 932
+    .quad 0x03FE4B6FD6F970C1F
+    .quad 0x03FE4BACFD036D080  # 0.647804171246 933
+    .quad 0x03FE4BACFD036D080
+    .quad 0x03FE4BEA2A5BDBE87  # 0.648270915712 934
+    .quad 0x03FE4BEA2A5BDBE87
+    .quad 0x03FE4C275F047C956  # 0.648737878130 935
+    .quad 0x03FE4C275F047C956
+    .quad 0x03FE4C649AFF0EE16  # 0.649205058703 936
+    .quad 0x03FE4C649AFF0EE16
+    .quad 0x03FE4CC082B46485A  # 0.649906239052 937
+    .quad 0x03FE4CC082B46485A
+    .quad 0x03FE4CFDD1037E37C  # 0.650373965908 938
+    .quad 0x03FE4CFDD1037E37C
+    .quad 0x03FE4D3B26AAADDD9  # 0.650841911635 939
+    .quad 0x03FE4D3B26AAADDD9
+    .quad 0x03FE4D7883ABB61F6  # 0.651310076438 940
+    .quad 0x03FE4D7883ABB61F6
+    .quad 0x03FE4DB5E8085A477  # 0.651778460521 941
+    .quad 0x03FE4DB5E8085A477
+    .quad 0x03FE4DF353C25E42B  # 0.652247064091 942
+    .quad 0x03FE4DF353C25E42B
+    .quad 0x03FE4E4F832C560DD  # 0.652950381434 943
+    .quad 0x03FE4E4F832C560DD
+    .quad 0x03FE4E8D015786F16  # 0.653419534621 944
+    .quad 0x03FE4E8D015786F16
+    .quad 0x03FE4ECA86E64A683  # 0.653888908016 945
+    .quad 0x03FE4ECA86E64A683
+    .quad 0x03FE4F0813DA673DD  # 0.654358501826 946
+    .quad 0x03FE4F0813DA673DD
+    .quad 0x03FE4F45A835A4E19  # 0.654828316258 947
+    .quad 0x03FE4F45A835A4E19
+    .quad 0x03FE4F8343F9CB678  # 0.655298351519 948
+    .quad 0x03FE4F8343F9CB678
+    .quad 0x03FE4FDFBB88A119A  # 0.656003818920 949
+    .quad 0x03FE4FDFBB88A119A
+    .quad 0x03FE501D69DADD660  # 0.656474407164 950
+    .quad 0x03FE501D69DADD660
+    .quad 0x03FE505B1F9C43ED7  # 0.656945216966 951
+    .quad 0x03FE505B1F9C43ED7
+    .quad 0x03FE5098DCCE9FABA  # 0.657416248534 952
+    .quad 0x03FE5098DCCE9FABA
+    .quad 0x03FE50D6A173BC425  # 0.657887502077 953
+    .quad 0x03FE50D6A173BC425
+    .quad 0x03FE51146D8D65F98  # 0.658358977805 954
+    .quad 0x03FE51146D8D65F98
+    .quad 0x03FE5152411D69C03  # 0.658830675927 955
+    .quad 0x03FE5152411D69C03
+    .quad 0x03FE51AF0C774A2D0  # 0.659538640558 956
+    .quad 0x03FE51AF0C774A2D0
+    .quad 0x03FE51ECF2B713F8A  # 0.660010895584 957
+    .quad 0x03FE51ECF2B713F8A
+    .quad 0x03FE522AE0738A3D8  # 0.660483373741 958
+    .quad 0x03FE522AE0738A3D8
+    .quad 0x03FE5268D5AE7CDCB  # 0.660956075239 959
+    .quad 0x03FE5268D5AE7CDCB
+    .quad 0x03FE52A6D269BC600  # 0.661429000289 960
+    .quad 0x03FE52A6D269BC600
+    .quad 0x03FE52E4D6A719F9B  # 0.661902149103 961
+    .quad 0x03FE52E4D6A719F9B
+    .quad 0x03FE5322E26867857  # 0.662375521893 962
+    .quad 0x03FE5322E26867857
+    .quad 0x03FE53800225BA6E2  # 0.663086001497 963
+    .quad 0x03FE53800225BA6E2
+    .quad 0x03FE53BE20B8DA502  # 0.663559935155 964
+    .quad 0x03FE53BE20B8DA502
+    .quad 0x03FE53FC46D64DDD1  # 0.664034093533 965
+    .quad 0x03FE53FC46D64DDD1
+    .quad 0x03FE543A747FE9ED6  # 0.664508476843 966
+    .quad 0x03FE543A747FE9ED6
+    .quad 0x03FE5478A9B78404C  # 0.664983085300 967
+    .quad 0x03FE5478A9B78404C
+    .quad 0x03FE54B6E67EF251C  # 0.665457919117 968
+    .quad 0x03FE54B6E67EF251C
+    .quad 0x03FE54F52AD80BAE9  # 0.665932978509 969
+    .quad 0x03FE54F52AD80BAE9
+    .quad 0x03FE553376C4A7A16  # 0.666408263689 970
+    .quad 0x03FE553376C4A7A16
+    .quad 0x03FE5571CA469E5C9  # 0.666883774872 971
+    .quad 0x03FE5571CA469E5C9
+    .quad 0x03FE55CF55C5A5437  # 0.667597465874 972
+    .quad 0x03FE55CF55C5A5437
+    .quad 0x03FE560DBC45153C7  # 0.668073543008 973
+    .quad 0x03FE560DBC45153C7
+    .quad 0x03FE564C2A6059FE7  # 0.668549846899 974
+    .quad 0x03FE564C2A6059FE7
+    .quad 0x03FE568AA0194EC6E  # 0.669026377763 975
+    .quad 0x03FE568AA0194EC6E
+    .quad 0x03FE56C91D71CF810  # 0.669503135817 976
+    .quad 0x03FE56C91D71CF810
+    .quad 0x03FE5707A26BB8C66  # 0.669980121278 977
+    .quad 0x03FE5707A26BB8C66
+    .quad 0x03FE57462F08E7DF5  # 0.670457334363 978
+    .quad 0x03FE57462F08E7DF5
+    .quad 0x03FE5784C34B3AC30  # 0.670934775289 979
+    .quad 0x03FE5784C34B3AC30
+    .quad 0x03FE57C35F3490183  # 0.671412444273 980
+    .quad 0x03FE57C35F3490183
+    .quad 0x03FE580202C6C7353  # 0.671890341535 981
+    .quad 0x03FE580202C6C7353
+    .quad 0x03FE5840AE03C0204  # 0.672368467291 982
+    .quad 0x03FE5840AE03C0204
+    .quad 0x03FE589EBD437CA31  # 0.673086084831 983
+    .quad 0x03FE589EBD437CA31
+    .quad 0x03FE58DD7BB392B30  # 0.673564782782 984
+    .quad 0x03FE58DD7BB392B30
+    .quad 0x03FE591C41D500163  # 0.674043709994 985
+    .quad 0x03FE591C41D500163
+    .quad 0x03FE595B0FA9A7EF1  # 0.674522866688 986
+    .quad 0x03FE595B0FA9A7EF1
+    .quad 0x03FE5999E5336E121  # 0.675002253082 987
+    .quad 0x03FE5999E5336E121
+    .quad 0x03FE59D8C2743705E  # 0.675481869398 988
+    .quad 0x03FE59D8C2743705E
+    .quad 0x03FE5A17A76DE803B  # 0.675961715857 989
+    .quad 0x03FE5A17A76DE803B
+    .quad 0x03FE5A56942266F7B  # 0.676441792678 990
+    .quad 0x03FE5A56942266F7B
+    .quad 0x03FE5A9588939A810  # 0.676922100084 991
+    .quad 0x03FE5A9588939A810
+    .quad 0x03FE5AD484C369F2D  # 0.677402638296 992
+    .quad 0x03FE5AD484C369F2D
+    .quad 0x03FE5B1388B3BD53E  # 0.677883407536 993
+    .quad 0x03FE5B1388B3BD53E
+    .quad 0x03FE5B5294667D5F7  # 0.678364408027 994
+    .quad 0x03FE5B5294667D5F7
+    .quad 0x03FE5B91A7DD93852  # 0.678845639990 995
+    .quad 0x03FE5B91A7DD93852
+    .quad 0x03FE5BD0C31AE9E9D  # 0.679327103649 996
+    .quad 0x03FE5BD0C31AE9E9D
+    .quad 0x03FE5C2F7A8ED5E5B  # 0.680049734055 997
+    .quad 0x03FE5C2F7A8ED5E5B
+    .quad 0x03FE5C6EA94431EF9  # 0.680531777930 998
+    .quad 0x03FE5C6EA94431EF9
+    .quad 0x03FE5CADDFC6874F5  # 0.681014054284 999
+    .quad 0x03FE5CADDFC6874F5
+    .quad 0x03FE5CED1E17C35C6  # 0.681496563340 1000
+    .quad 0x03FE5CED1E17C35C6
+    .quad 0x03FE5D2C6439D4252  # 0.681979305324 1001
+    .quad 0x03FE5D2C6439D4252
+    .quad 0x03FE5D6BB22EA86F6  # 0.682462280460 1002
+    .quad 0x03FE5D6BB22EA86F6
+    .quad 0x03FE5DAB07F82FB84  # 0.682945488974 1003
+    .quad 0x03FE5DAB07F82FB84
+    .quad 0x03FE5DEA65985A350  # 0.683428931091 1004
+    .quad 0x03FE5DEA65985A350
+    .quad 0x03FE5E29CB1118D32  # 0.683912607038 1005
+    .quad 0x03FE5E29CB1118D32
+    .quad 0x03FE5E6938645D390  # 0.684396517040 1006
+    .quad 0x03FE5E6938645D390
+    .quad 0x03FE5EA8AD9419C5B  # 0.684880661324 1007
+    .quad 0x03FE5EA8AD9419C5B
+    .quad 0x03FE5EE82AA241920  # 0.685365040118 1008
+    .quad 0x03FE5EE82AA241920
+    .quad 0x03FE5F27AF90C8705  # 0.685849653648 1009
+    .quad 0x03FE5F27AF90C8705
+    .quad 0x03FE5F673C61A2ED2  # 0.686334502142 1010
+    .quad 0x03FE5F673C61A2ED2
+    .quad 0x03FE5FA6D116C64F7  # 0.686819585829 1011
+    .quad 0x03FE5FA6D116C64F7
+    .quad 0x03FE5FE66DB228992  # 0.687304904936 1012
+    .quad 0x03FE5FE66DB228992
+    .quad 0x03FE60261235C0874  # 0.687790459692 1013
+    .quad 0x03FE60261235C0874
+    .quad 0x03FE6065BEA385926  # 0.688276250325 1014
+    .quad 0x03FE6065BEA385926
+    .quad 0x03FE60A572FD6FEF1  # 0.688762277066 1015
+    .quad 0x03FE60A572FD6FEF1
+    .quad 0x03FE60E52F45788E4  # 0.689248540144 1016
+    .quad 0x03FE60E52F45788E4
+    .quad 0x03FE6124F37D991D4  # 0.689735039789 1017
+    .quad 0x03FE6124F37D991D4
+    .quad 0x03FE6164BFA7CC06C  # 0.690221776231 1018
+    .quad 0x03FE6164BFA7CC06C
+    .quad 0x03FE61A493C60C729  # 0.690708749700 1019
+    .quad 0x03FE61A493C60C729
+    .quad 0x03FE61E46FDA56466  # 0.691195960429 1020
+    .quad 0x03FE61E46FDA56466
+    .quad 0x03FE622453E6A6263  # 0.691683408647 1021
+    .quad 0x03FE622453E6A6263
+    .quad 0x03FE62643FECF9743  # 0.692171094587 1022
+    .quad 0x03FE62643FECF9743
+    .quad 0x03FE62A433EF4E51A  # 0.692659018480 1023
+    .quad 0x03FE62A433EF4E51A

diff --git a/src/gas/vrdacos.S b/src/gas/vrdacos.S
new file mode 100644
index 0000000..5e2b3a4
--- /dev/null
+++ b/src/gas/vrdacos.S

@@ -0,0 +1,3118 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdacos.s
+#
+# An array implementation of the cos libm function.
+#
+# Prototype:
+#
+#    void vrda_cos(int n, double *x, double *y);
+#
+#Computes Cosine of x for an array of input values.
+#Places the results into the supplied y array.
+#Does not perform error checking.
+#Denormal inputs may produce unexpected results
+#Author: Harsha Jagasia
+#Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: 	.quad 0x07fffffffffffffff	#Sign bit zero
+				.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: 	.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+				.quad 0x03e40000000000000
+.L__real_3fe0000000000000: 	.quad 0x03fe0000000000000	# 0.5
+				.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: 	.quad 0x03fc5555555555555	# 0.166666666666
+				.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: 	.quad 0x03fe45f306dc9c883	# twobypi
+				.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: 	.quad 0x03ff921fb54400000	# piby2_1
+				.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: 	.quad 0x03dd0b4611a626331	# piby2_1tail
+				.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: 	.quad 0x03dd0b4611a600000	# piby2_2
+				.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: 	.quad 0x03ba3198a2e037073	# piby2_2tail
+				.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: 	.quad 0x0fffffffff8000000	# mask for stripping head and tail
+				.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+				.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+				.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+				.quad 0
+.L__reald_one_zero:		.quad 0x00000000100000000	# sin_cos_filter
+				.quad 0
+.L__reald_zero_one:		.quad 0x00000000000000001	#
+				.quad 0
+.L__reald_two_zero:		.quad 0x00000000200000000	#
+				.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+				.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+				.quad 0x03ff0000000000000	#
+.L__real_zero:			.quad 0x00000000000000000	#
+				.quad 0x00000000000000000	#
+.L__real_one:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+
+.Lcosarray:
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x03fa5555555555555
+	.quad	0x0bf56c16c16c16967		# -0.00138889	   		c2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03efa01a019f4ec90		# 2.48016e-005			c3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0be927e4fa17f65f6		# -2.75573e-007			c4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03e21eeb69037ab78		# 2.08761e-009			c5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0bda907db46cc5e42		# -1.13826e-011	   		c6
+	.quad	0x0bda907db46cc5e42
+.Lsinarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bfc5555555555555
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x03f81111111110bb3
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0bf2a01a019e83e5c
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x03ec71de3796cde01
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0be5ae600b42fdfa7
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+	.quad	0x03de5e0b2f9a43bb8
+.Lsincosarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+	.quad	0x0bda907db46cc5e42
+.Lcossinarray:
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0bda907db46cc5e42
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+
+.align 16
+.Levencos_oddsin_tbl:
+		.quad	.Lcoscos_coscos_piby4		# 0		*
+		.quad	.Lcoscos_cossin_piby4		# 1		+
+		.quad	.Lcoscos_sincos_piby4		# 2
+		.quad	.Lcoscos_sinsin_piby4		# 3		+
+
+		.quad	.Lcossin_coscos_piby4		# 4
+		.quad	.Lcossin_cossin_piby4		# 5		*
+		.quad	.Lcossin_sincos_piby4		# 6
+		.quad	.Lcossin_sinsin_piby4		# 7
+
+		.quad	.Lsincos_coscos_piby4		# 8
+		.quad	.Lsincos_cossin_piby4		# 9
+		.quad	.Lsincos_sincos_piby4		# 10		*
+		.quad	.Lsincos_sinsin_piby4		# 11
+
+		.quad	.Lsinsin_coscos_piby4		# 12
+		.quad	.Lsinsin_cossin_piby4		# 13		+
+		.quad	.Lsinsin_sincos_piby4		# 14
+		.quad	.Lsinsin_sinsin_piby4		# 15		*
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        .weak vrda_cos_
+        .set vrda_cos_,__vrda_cos__
+        .weak vrda_cos__
+        .set vrda_cos__,__vrda_cos__
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array cos
+#**     VRDA_COS(N,X,Y)
+# C equivalent*/
+#void vrda_cos__(int * n, double *x, double *y)
+#{
+#       vrda_cos(*n,x,y);
+#}
+.globl __vrda_cos__
+    .type   __vrda_cos__,@function
+__vrda_cos__:
+    mov         (%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+
+# define local variable storage offsets
+.equ	p_temp,		0x00		# temporary for get/put bits operation
+.equ	p_temp1,	0x10		# temporary for get/put bits operation
+
+.equ	p_xmm6,		0x20		# temporary for get/put bits operation
+.equ	p_xmm7,		0x30		# temporary for get/put bits operation
+.equ	p_xmm8,		0x40		# temporary for get/put bits operation
+.equ	p_xmm9,		0x50		# temporary for get/put bits operation
+.equ	p_xmm10,	0x60		# temporary for get/put bits operation
+.equ	p_xmm11,	0x70		# temporary for get/put bits operation
+.equ	p_xmm12,	0x80		# temporary for get/put bits operation
+.equ	p_xmm13,	0x90		# temporary for get/put bits operation
+.equ	p_xmm14,	0x0A0		# temporary for get/put bits operation
+.equ	p_xmm15,	0x0B0		# temporary for get/put bits operation
+
+.equ	r,		0x0C0		# pointer to r for remainder_piby2
+.equ	rr,		0x0D0		# pointer to r for remainder_piby2
+.equ	region,		0x0E0		# pointer to r for remainder_piby2
+
+.equ	r1,		0x0F0		# pointer to r for remainder_piby2
+.equ	rr1,		0x0100		# pointer to r for remainder_piby2
+.equ	region1,	0x0110		# pointer to r for remainder_piby2
+
+.equ	p_temp2,	0x0120		# temporary for get/put bits operation
+.equ	p_temp3,	0x0130		# temporary for get/put bits operation
+
+.equ	p_temp4,	0x0140		# temporary for get/put bits operation
+.equ	p_temp5,	0x0150		# temporary for get/put bits operation
+
+.equ	p_original,	0x0160		# original x
+.equ	p_mask,		0x0170		# original x
+.equ	p_sign,		0x0180		# original x
+
+.equ	p_original1,	0x0190		# original x
+.equ	p_mask1,	0x01A0		# original x
+.equ	p_sign1,	0x01B0		# original x
+
+.equ	save_xa,	0x01C0		#qword
+.equ	save_ya,	0x01D0		#qword
+
+.equ	save_nv,	0x01E0		#qword
+.equ	p_iter,		0x01F0		#qword	storage for number of loop iterations
+
+
+.globl vrda_cos
+    .type   vrda_cos,@function
+vrda_cos:
+# parameters are passed in by Linux C as:
+# edi - int n
+# rsi - double *x
+# rdx - double *y
+
+
+	sub		$0x208,%rsp
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START PROCESS INPUT
+# save the arguments
+	mov		%rsi,save_xa(%rsp)		# save x_array pointer
+	mov		%rdx,save_ya(%rsp)		# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+	mov		%rdi,save_nv(%rsp)		# save number of values
+
+# see if too few values to call the main loop
+	shr		$2,%rax				# get number of iterations
+	jz		.L__vrda_cleanup		# jump if only single calls
+
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)		# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi			# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)		# save number of left over values
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrda_top:
+# build the input _m128d
+	movapd		.L__real_7fffffffffffffff(%rip),%xmm2
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movlpd		(%rsi),%xmm0
+	movhpd		8(%rsi),%xmm0
+
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+	movdqa	 %xmm0,p_original(%rsp)
+	movlpd	 -16(%rsi), %xmm1
+	movhpd	 -8(%rsi),  %xmm1
+	movdqa	 %xmm1,p_original1(%rsp)
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+andpd 	%xmm2,%xmm0				#Unsign
+andpd 	%xmm2,%xmm1				#Unsign
+
+movd	%xmm0,%rax				#rax is lower arg
+movhpd	%xmm0, p_temp+8(%rsp)			#
+mov    	p_temp+8(%rsp),%rcx			#rcx = upper arg
+movd	%xmm1,%r8				#rax is lower arg
+movhpd	%xmm1, p_temp1+8(%rsp)			#
+mov    	p_temp1+8(%rsp),%r9			#rcx = upper arg
+
+mov 	$0x3FE921FB54442D18,%rdx		#piby4	+
+mov	$0x411E848000000000,%r10		#5e5	+
+
+movapd	.L__real_3fe0000000000000(%rip),%xmm4	#0.5 for later use
+
+movapd	%xmm0,%xmm2				#x0
+movapd	%xmm1,%xmm3				#x1
+movapd	%xmm0,%xmm6				#x0
+movapd	%xmm1,%xmm7				#x1
+
+#DEBUG
+#	add		$0x1C8,%rsp
+#	ret
+#	movapd	%xmm0,%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax
+	jae	.Lfirst_or_next3_arg_gt_5e5
+
+	cmp	%r10,%rcx
+	jae	.Lsecond_or_next2_arg_gt_5e5
+
+	cmp	%r10,%r8
+	jae	.Lthird_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfourth_arg_gt_5e5
+
+
+#      /* Find out what multiple of piby2 */
+#        npi2  = (int)(x * twobypi + 0.5);
+	movapd	.L__real_3fe45f306dc9c883(%rip),%xmm0
+	mulpd	%xmm0,%xmm2						# * twobypi
+	mulpd	%xmm0,%xmm3						# * twobypi
+
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%rax						# Region
+	movd	%xmm5,%rcx						# Region
+
+	mov	%rax,%r8
+	mov	%rcx,%r9
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm1,%xmm7						# t-rhead
+
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm0 =rhead, xmm8 =rtail
+# xmm5  = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+#	paddd		.L__reald_one_one(%rip),%xmm4		; Sign
+#	paddd		.L__reald_one_one(%rip),%xmm5		; Sign
+#	pand		.L__reald_two_two(%rip),%xmm4
+#	pand		.L__reald_two_two(%rip),%xmm5
+#	punpckldq 	%xmm4,%xmm4
+#	punpckldq 	%xmm5,%xmm5
+#	psllq		$62,%xmm4
+#	psllq		$62,%xmm5
+
+
+	add .L__reald_one_one(%rip),%r8
+	add .L__reald_one_one(%rip),%r9
+	and .L__reald_two_two(%rip),%r8
+	and .L__reald_two_two(%rip),%r9
+
+	mov %r8,%r10
+	mov %r9,%r11
+	shl $62,%r8
+	and .L__reald_two_zero(%rip),%r10
+	shl $30,%r10
+	shl $62,%r9
+	and .L__reald_two_zero(%rip),%r11
+	shl $30,%r11
+
+	mov	 %r8,p_sign(%rsp)
+	mov	 %r10,p_sign+8(%rsp)
+	mov	 %r9,p_sign1(%rsp)
+	mov	 %r11,p_sign1+8(%rsp)
+
+# GET_BITS_DP64(rhead-rtail, uy);			   		; originally only rhead
+# xmm4  = Sign, xmm0 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 =rhead, xmm9 =rtail
+	movapd	%xmm0,%xmm6						# rhead
+	movapd	%xmm1,%xmm7						# rhead
+
+	and	.L__reald_one_one(%rip),%rax		# Region
+	and	.L__reald_one_one(%rip),%rcx		# Region
+
+	subpd	%xmm8,%xmm0						# r = rhead - rtail
+	subpd	%xmm9,%xmm1						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+	subpd	%xmm0,%xmm6				#rr=rhead-r
+	subpd	%xmm1,%xmm7				#rr=rhead-r
+
+	mov	%rax,%r8
+	mov	%rcx,%r9
+
+	movapd	%xmm0,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm0,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	subpd	%xmm8,%xmm6				#rr=(rhead-r) -rtail
+	subpd	%xmm9,%xmm7				#rr=(rhead-r) -rtail
+
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+	leaq	 .Levencos_oddsin_tbl(%rip),%rsi
+	jmp	 *(%rsi,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+#DEBUG
+#	movapd	%xmm0,%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm10, xmm12
+#	    %xmm11,,%xmm9 xmm13
+
+
+#DEBUG
+#	movapd	%xmm0,%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+	movlpd	 %xmm0,r(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm0,%xmm0			#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm8 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm12 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm10					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	 %xmm6,%xmm0
+       subsd	 %xmm10,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	 %xmm0,%xmm6					# rr=rhead-r
+       subsd	 %xmm10,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r+8(%rsp)			# store upper r
+       movlpd	 %xmm6,rr+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_lower_naninf
+
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+	movlpd	 r(%rsp),%xmm0	#Restore lower fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrd4_cos_lower_naninf:
+	mov	p_original(%rsp),%rax			# upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				# rr = 0
+	mov	 %r10d,region(%rsp)			# region =0
+
+.align 16
+0:
+
+
+#DEBUG
+#	movapd	.LOWORD,%xmm4 PTR r[rsp]
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+
+#DEBUG
+#	movapd	%xmm0,%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+
+	movhpd 	%xmm0,r+8(%rsp)		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrd4_cos_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+	mov	p_original(%rsp),%rax
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				#rr = 0
+	mov	 %r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_upper_naninf_of_both_gt_5e5
+
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrd4_cos_upper_naninf_of_both_gt_5e5:
+	mov	p_original+8(%rsp),%rcx		#upper arg is nan/inf
+#	movd	%xmm6,%rcx					;upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	%rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr+8(%rsp)			#rr = 0
+	mov	%r10d,region+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm10,,%xmm8 xmm12
+#   %xmm9,,%xmm5 xmm11, xmm13
+
+	movhpd	%xmm0,r+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+#	movlhps	%xmm0,%xmm0			;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+#	movlhps	%xmm2,%xmm2
+#	movlhps	%xmm6,%xmm6
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm7 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm10					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store upper region
+       movsd	%xmm6,%xmm0
+       subsd	%xmm10,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm0,%xmm6					# rr=rhead-r
+       subsd	%xmm10,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r(%rsp)				# store upper r
+       movlpd	 %xmm6,rr(%rsp)				# store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_upper_naninf
+
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	 %xmm1,p_temp1(%rsp)
+	movapd	 %xmm3,p_temp3(%rsp)
+	movapd	 %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	p_temp1(%rsp),%xmm1
+	movapd	p_temp3(%rsp),%xmm3
+	movapd	p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrd4_cos_upper_naninf:
+	mov	p_original+8(%rsp),%rcx		# upper arg is nan/inf
+#	mov	r+8(%rsp),%rcx				; upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr+8(%rsp)			# rr = 0
+	mov	 %r10d,region+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+#DEBUG
+#	movapd	r(%rsp),%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r8
+	jae	.Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfirst_second_done_fourth_arg_gt_5e5
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4			#Restore 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm3						# * twobypi
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm5,region1(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm1,%xmm7						# t-rhead
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm1,%xmm7						# rhead
+       subpd	%xmm9,%xmm1						# r = rhead - rtail
+       movapd	 %xmm1,r1(%rsp)
+
+       subpd	%xmm1,%xmm7						# rr=rhead-r
+       subpd	%xmm9,%xmm7						# rr=(rhead-r) -rtail
+       movapd	 %xmm7,rr1(%rsp)
+
+	jmp	.L__vrd4_cos_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use 	%xmm11,,%xmm9 xmm13
+# 	%xmm8,,%xmm5 xmm10, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+#DEBUG
+#	movapd	%xmm0,%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2						# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm4,region(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm0,%xmm6						# rhead
+       subpd	%xmm8,%xmm0						# r = rhead - rtail
+       movapd	 %xmm0,r(%rsp)
+
+       subpd	%xmm0,%xmm6						# rr=rhead-r
+       subpd	%xmm8,%xmm6						# rr=(rhead-r) -rtail
+       movapd	 %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+
+#DEBUG
+#	movapd	r(%rsp),%xmm4
+#	movapd	%xmm1,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+	cmp	%r10,%r9
+	jae	.Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movlpd	 %xmm1,r1(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm1,%xmm1			#Needed since we want to work on upper arg
+	movhlps	%xmm3,%xmm3
+	movhlps	%xmm7,%xmm7
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4	#0.5 for later use
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r9d				# r9d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%r9d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm0					# xmm0 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r9d,region1+4(%rsp)			# store upper region
+       movsd	%xmm7,%xmm1
+       subsd	%xmm0,%xmm1					# xmm1 = r=(rhead-rtail)
+       subsd	%xmm1,%xmm7					# rr=rhead-r
+       subsd	%xmm0,%xmm7					# xmm7 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm1,r1+8(%rsp)			# store upper r
+       movlpd	 %xmm7,rr1+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_lower_naninf_higher
+
+	lea	 region1(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr1(%rsp),%rsi
+	lea	 r1(%rsp),%rdi
+	movlpd	 r1(%rsp),%xmm0				#Restore lower fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_cos_lower_naninf_higher:
+	mov	p_original1(%rsp),%r8			# upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1(%rsp)				# rr = 0
+	mov	 %r10d,region1(%rsp)			# region =0
+
+.align 16
+0:
+
+
+#DEBUG
+#	movapd	rr(%rsp),%xmm4
+#	movapd	rr1(%rsp),%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	jmp 	.L__vrd4_cos_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+#DEBUG
+#	movapd	r(%rsp),%xmm4
+#	movd	%r8,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	movhpd 	%xmm1,r1+8(%rsp)				#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher
+
+	mov	  %r9,p_temp1(%rsp)			#Save upper arg
+	lea	 region1(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr1(%rsp),%rsi
+	lea	 r1(%rsp),%rdi
+	movsd	 %xmm1,%xmm0
+        call    __amd_remainder_piby2@PLT
+	mov	 p_temp1(%rsp),%r9			#Restore upper arg
+
+
+#DEBUG
+#	movapd	 r(%rsp),%xmm4
+#	mov	 QWORD PTR r1[rsp+8], r9
+#	movapd	 r1(%rsp),%xmm5
+#	xorpd	 %xmm0,%xmm0
+#	xorpd	 %xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+	jmp 	0f
+
+.L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher:				#lower arg is nan/inf
+	mov	p_original1(%rsp),%r8
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1(%rsp)				#rr = 0
+	mov	 %r10d,region1(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher
+
+	lea	 region1+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr1+8(%rsp),%rsi
+	lea	 r1+8(%rsp),%rdi
+	movlpd	 r1+8(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher:
+	mov	p_original1+8(%rsp),%r9		#upper arg is nan/inf
+#	movd	%xmm6,%r9					;upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1+8(%rsp)			#rr = 0
+	mov	 %r10d,region1+4(%rsp)			#region = 0
+
+.align 16
+0:
+
+#DEBUG
+#	movapd	r(%rsp),%xmm4
+#	movapd	r1(%rsp),%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+
+	jmp 	.L__vrd4_cos_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2						# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm4,region(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm0,%xmm6						# rhead
+       subpd	%xmm8,%xmm0						# r = rhead - rtail
+       movapd	 %xmm0,r(%rsp)
+
+       subpd	%xmm0,%xmm6						# rr=rhead-r
+       subpd	%xmm8,%xmm6						# rr=(rhead-r) -rtail
+       movapd	 %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd	 %xmm1,r1+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+#	movlhps	%xmm1,%xmm1			;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+#	movlhps	%xmm3,%xmm3
+#	movlhps	%xmm7,%xmm7
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4	#0.5 for later use
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r8d				# r8d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%r8d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm0					# xmm0 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r8d,region1(%rsp)			# store lower region
+       movsd	%xmm7,%xmm1
+       subsd	%xmm0,%xmm1					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm1,%xmm7					# rr=rhead-r
+       subsd	%xmm0,%xmm7					# xmm6 = rr=((rhead-r) -rtail)
+
+       movlpd	 %xmm1,r1(%rsp)				# store upper r
+       movlpd	 %xmm7,rr1(%rsp)				# store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_cos_upper_naninf_higher
+
+	lea	 region1+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr1+8(%rsp),%rsi
+	lea	 r1+8(%rsp),%rdi
+	movlpd	 r1+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_cos_upper_naninf_higher:
+	mov	p_original1+8(%rsp),%r9		# upper arg is nan/inf
+#	mov	r1+8(%rsp),%r9			# upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)			# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1+8(%rsp)		# rr = 0
+	mov	 %r10d,region1+4(%rsp)		# region =0
+
+.align 16
+0:
+	jmp	.L__vrd4_cos_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_cos_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#DEBUG
+#	movapd	region(%rsp),%xmm4
+#	movapd	region1(%rsp),%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	movapd	r(%rsp),%xmm0
+	movapd	r1(%rsp),%xmm1
+
+	movapd	rr(%rsp),%xmm6
+	movapd	rr1(%rsp),%xmm7
+
+	mov	region(%rsp),%rax
+	mov	region1(%rsp),%rcx
+
+	mov	%rax,%r8
+	mov	%rcx,%r9
+
+	add .L__reald_one_one(%rip),%r8
+	add .L__reald_one_one(%rip),%r9
+	and .L__reald_two_two(%rip),%r8
+	and .L__reald_two_two(%rip),%r9
+
+	mov %r8,%r10
+	mov %r9,%r11
+	shl $62,%r8
+	and .L__reald_two_zero(%rip),%r10
+	shl $30,%r10
+	shl $62,%r9
+	and .L__reald_two_zero(%rip),%r11
+	shl $30,%r11
+
+	mov	 %r8,p_sign(%rsp)
+	mov	 %r10,p_sign+8(%rsp)
+	mov	 %r9,p_sign1(%rsp)
+	mov	 %r11,p_sign1+8(%rsp)
+
+	and	.L__reald_one_one(%rip),%rax		# Region
+	and	.L__reald_one_one(%rip),%rcx		# Region
+
+	mov	%rax,%r8
+	mov	%rcx,%r9
+
+	movapd	%xmm0,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm0,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+#DEBUG
+#	movd	%rax,%xmm4
+#	movd	%rax,%xmm5
+#	xorpd	%xmm0,%xmm0
+#	xorpd	%xmm1,%xmm1
+#	jmp 	.L__vrd4_cos_cleanup
+#DEBUG
+
+	leaq	 .Levencos_oddsin_tbl(%rip),%rsi
+	jmp	 *(%rsi,%rax,8)	#Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_cos_cleanup:
+
+	movapd	  p_sign(%rsp), %xmm0
+	movapd	  p_sign1(%rsp),%xmm1
+
+	xorpd	%xmm4,%xmm0			# (+) Sign
+	xorpd	%xmm5,%xmm1			# (+) Sign
+
+.L__vrda_bottom1:
+# store the result _m128d
+	mov	save_ya(%rsp),%rdi		# get y_array pointer
+	movlpd	%xmm0,(%rdi)
+	movhpd	%xmm0,8(%rdi)
+
+.L__vrda_bottom2:
+	prefetch	64(%rdi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+	movlpd	%xmm1, -16(%rdi)
+	movhpd	%xmm1, -8(%rdi)
+
+	mov	p_iter(%rsp),%rax	# get number of iterations
+	sub	$1,%rax
+	mov	%rax,p_iter(%rsp)	# save number of iterations
+	jnz	.L__vrda_top
+
+# see if we need to do any extras
+	mov	save_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz	.L__vrda_cleanup
+
+.L__final_check:
+	add	$0x208,%rsp
+	ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# we assume that rdx is pointing at the next x array element, r8 at the next y array element.
+# The number of values left is in save_nv
+
+.align	16
+.L__vrda_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__final_check         # exit if not
+
+	mov		 save_xa(%rsp),%rsi
+	mov		 save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorpd		 %xmm0,%xmm0
+	movlpd		 %xmm0,p_temp+8(%rsp)
+	movapd		 %xmm0,p_temp+16(%rsp)
+
+	mov		 (%rsi),%rcx			# we know there's at least one
+	mov	 	 %rcx,p_temp(%rsp)
+	cmp		 $2,%rax
+	jl		 .L__vrdacg
+
+	mov		 8(%rsi),%rcx			# do the second value
+	mov	 	 %rcx,p_temp+8(%rsp)
+	cmp		 $3,%rax
+	jl		 .L__vrdacg
+
+	mov		 16(%rsi),%rcx			# do the third value
+	mov	 	 %rcx,p_temp+16(%rsp)
+
+.L__vrdacg:
+	mov		 $4,%rdi				# parameter for N
+	lea		 p_temp(%rsp),%rsi			# &x parameter
+	lea		 p_temp2(%rsp),%rdx 			# &y parameter
+        call    	 vrda_cos@PLT
+
+# now copy the results to the destination array
+	mov		  save_ya(%rsp),%rdi
+	mov		  save_nv(%rsp),%rax		# get number of values
+	mov	 	  p_temp2(%rsp),%rcx
+	mov		  %rcx, (%rdi)			# we know there's at least one
+	cmp		  $2,%rax
+	jl		  .L__vrdacgf
+
+	mov	 	  p_temp2+8(%rsp),%rcx
+	mov		  %rcx, 8(%rdi)			# do the second value
+	cmp		  $3,%rax
+	jl		  .L__vrdacgf
+
+	mov	 	  p_temp2+16(%rsp),%rcx
+	mov		  %rcx, 16(%rdi)		# do the third value
+
+.L__vrdacgf:
+	jmp		  .L__final_check
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	 %xmm2,%xmm4					# c6*x2
+	mulpd	 %xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+
+	mulpd	 %xmm2,%xmm8					# c3*x2
+	mulpd	 %xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0	;trash r
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10		# 1 + (-t)	;trash t
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11		# 1 + (-t)	;trash t
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm2					# x4 recalculate
+	mulpd	%xmm3,%xmm3					# x4 recalculate
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12		# t relaculate, -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13		# t relaculate, -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subpd   %xmm13,%xmm5					# + t
+
+	jmp 	.L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4		# s6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5		# s6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8		# s3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm10,%xmm10				# move high x4 for cos term
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	%xmm0,%xmm6				# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos
+	movhlps	%xmm3,%xmm13				# move high r for cos
+	movhlps	%xmm4,%xmm8				# xmm4 = sin , xmm8 = cos
+	movhlps	%xmm5,%xmm9				# xmm4 = sin , xmm8 = cos
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm7,%xmm5				# sin *x3
+	mulsd	%xmm10,%xmm8				# cos *x4
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	mulsd	p_temp(%rsp),%xmm2		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1(%rsp),%xmm3		# 0.5 * x2 * xx for sin term
+	movsd	%xmm12,%xmm6				# Keep high r for cos term
+	movsd	%xmm13,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	# -t=r-1.0
+
+	subsd	%xmm2,%xmm4				# sin - 0.5 * x2 *xx
+	subsd	%xmm3,%xmm5				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x*xx for cos term
+	movhlps	%xmm1,%xmm11				# move high x for x*xx for cos term
+
+	mulsd	p_temp+8(%rsp),%xmm10		# x * xx
+	mulsd	p_temp1+8(%rsp),%xmm11		# x * xx
+
+	movsd	%xmm12,%xmm2				# move -t for cos term
+	movsd	%xmm13,%xmm3				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm12 	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm13	# 1+(-t)
+	addsd	p_temp(%rsp),%xmm4			# sin+xx
+	addsd	p_temp1(%rsp),%xmm5			# sin+xx
+	subsd	%xmm6,%xmm12				# (1-t) - r
+	subsd	%xmm7,%xmm13				# (1-t) - r
+	subsd	%xmm10,%xmm12				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm11,%xmm13				# ((1 + (-t)) - r) - x*xx
+	addsd	%xmm0,%xmm4				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+	addsd   %xmm12,%xmm8				# cos+((1-t)-r - x*xx)
+	addsd   %xmm13,%xmm9				# cos+((1-t)-r - x*xx)
+	subsd   %xmm2,%xmm8				# cos+t
+	subsd   %xmm3,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:					# changed from sincos_sincos
+							# xmm1 is cossin and xmm0 is sincos
+
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm1,p_temp3(%rsp)		# Store r for the sincos term
+
+	movapd	.Lsincosarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lcossinarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lsincosarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lcossinarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm10,%xmm10				# move high x4 for cos term
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	movhlps	%xmm3,%xmm7				# move high x2 for x3 for sin term (sincos)
+
+	mulsd	%xmm0,%xmm6				# get low x3 for sin term
+	mulsd	p_temp3+8(%rsp),%xmm7		# get high x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos (cossin)
+	movhlps	%xmm3,%xmm13				# move high 0.5*x2 for sin term (sincos)
+
+	movhlps	%xmm4,%xmm8				# xmm8 = cos , xmm4 = sin	(cossin)
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos	(sincos)
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm11,%xmm5				# cos *x4
+	mulsd	%xmm10,%xmm8				# cos *x4
+	mulsd	%xmm7,%xmm9				# sin *x3
+
+	mulsd	p_temp(%rsp),%xmm2		# low  0.5 * x2 * xx for sin term (cossin)
+	mulsd	p_temp1+8(%rsp),%xmm13		# high 0.5 * x2 * xx for sin term (sincos)
+
+	movsd	%xmm12,%xmm6				# Keep high r for cos term
+	movsd	%xmm3,%xmm7				# Keep low r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	# -t=r-1.0
+
+	subsd	%xmm2,%xmm4				# sin - 0.5 * x2 *xx	(cossin)
+	subsd	%xmm13,%xmm9				# sin - 0.5 * x2 *xx	(sincos)
+
+	movhlps	%xmm0,%xmm10				# move high x for x*xx for cos term (cossin)
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term    (sincos)
+
+	mulsd	p_temp+8(%rsp),%xmm10		# x * xx
+	mulsd	p_temp1(%rsp),%xmm1		# x * xx
+
+	movsd	%xmm12,%xmm2				# move -t for cos term
+	movsd	%xmm3,%xmm13				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm12	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t)
+
+	addsd	p_temp(%rsp),%xmm4		# sin+xx	+
+	addsd	p_temp1+8(%rsp),%xmm9		# sin+xx	+
+
+	subsd	%xmm6,%xmm12				# (1-t) - r
+	subsd	%xmm7,%xmm3				# (1-t) - r
+
+	subsd	%xmm10,%xmm12				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm1,%xmm3				# ((1 + (-t)) - r) - x*xx
+
+	addsd	%xmm0,%xmm4				# sin + x	+
+	addsd	%xmm11,%xmm9				# sin + x	+
+
+	addsd   %xmm12,%xmm8				# cos+((1-t)-r - x*xx)
+	addsd   %xmm3,%xmm5				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm2,%xmm8				# cos+t
+	subsd   %xmm13,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrd4_cos_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm0,p_temp2(%rsp)		# Store r
+	movapd	 %xmm1,p_temp3(%rsp)		# Store r
+
+
+	movapd	.Lcossinarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lcossinarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lcossinarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lcossinarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movhlps	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movhlps	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	p_temp2+8(%rsp),%xmm6		# get low x3 for sin term
+	mulsd	p_temp3+8(%rsp),%xmm7		# get low x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high 0.5*x2 for sin term
+	movhlps	%xmm3,%xmm13				# move high 0.5*x2 for sin term
+							# Reverse 12 and 2
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos
+
+	mulsd	%xmm6,%xmm8				# sin *x3
+	mulsd	%xmm7,%xmm9				# sin *x3
+	mulsd	%xmm10,%xmm4				# cos *x4
+	mulsd	%xmm11,%xmm5				# cos *x4
+
+	mulsd	p_temp+8(%rsp),%xmm12		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1+8(%rsp),%xmm13		# 0.5 * x2 * xx for sin term
+	movsd	%xmm2,%xmm6				# Keep high r for cos term
+	movsd	%xmm3,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	# -t=r-1.0
+
+	subsd	%xmm12,%xmm8				# sin - 0.5 * x2 *xx
+	subsd	%xmm13,%xmm9				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term
+							# Reverse 10 and 0
+
+	mulsd	p_temp(%rsp),%xmm0		# x * xx
+	mulsd	p_temp1(%rsp),%xmm1		# x * xx
+
+	movsd	%xmm2,%xmm12				# move -t for cos term
+	movsd	%xmm3,%xmm13				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm2 	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm3 	# 1+(-t)
+	addsd	p_temp+8(%rsp),%xmm8			# sin+xx
+	addsd	p_temp1+8(%rsp),%xmm9			# sin+xx
+
+	subsd	%xmm6,%xmm2				# (1-t) - r
+	subsd	%xmm7,%xmm3				# (1-t) - r
+
+	subsd	%xmm0,%xmm2				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm1,%xmm3				# ((1 + (-t)) - r) - x*xx
+
+	addsd	%xmm10,%xmm8				# sin + x
+	addsd	%xmm11,%xmm9				# sin + x
+
+	addsd   %xmm2,%xmm4				# cos+((1-t)-r - x*xx)
+	addsd   %xmm3,%xmm5				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm12,%xmm4				# cos+t
+	subsd   %xmm13,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:					# changed from sincos_sincos
+							# xmm1 is cossin and xmm0 is sincos
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm0,p_temp2(%rsp)		# Store r
+
+
+	movapd	.Lcossinarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lsincosarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lcossinarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lsincosarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term	+
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movhlps	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term   	+
+	mulsd	p_temp2+8(%rsp),%xmm6		# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term		+
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high 0.5*x2 for sin term
+	movhlps	%xmm3,%xmm13				# move high r for cos
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm9				# xmm9 = cos , xmm5 = sin
+
+	mulsd	%xmm6,%xmm8				# sin *x3
+	mulsd	%xmm11,%xmm9				# cos *x4
+	mulsd	%xmm10,%xmm4				# cos *x4
+	mulsd	%xmm7,%xmm5				# sin *x3
+
+	mulsd	p_temp+8(%rsp),%xmm12		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1(%rsp),%xmm3		# 0.5 * x2 * xx for sin term
+
+	movsd	%xmm2,%xmm6				# Keep high r for cos term
+	movsd	%xmm13,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	# -t=r-1.0
+
+	subsd	%xmm12,%xmm8				# sin - 0.5 * x2 *xx
+	subsd	%xmm3,%xmm5				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x*xx for cos term
+
+	mulsd	p_temp(%rsp),%xmm0		# x * xx
+	mulsd	p_temp1+8(%rsp),%xmm11		# x * xx
+
+	movsd	%xmm2,%xmm12				# move -t for cos term
+	movsd	%xmm13,%xmm3				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm2 	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm13 	# 1+(-t)
+
+	addsd	p_temp+8(%rsp),%xmm8		# sin+xx
+	addsd	p_temp1(%rsp),%xmm5		# sin+xx
+
+	subsd	%xmm6,%xmm2				# (1-t) - r
+	subsd	%xmm7,%xmm13				# (1-t) - r
+
+	subsd	%xmm0,%xmm2				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm11,%xmm13				# ((1 + (-t)) - r) - x*xx
+
+
+	addsd	%xmm10,%xmm8				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+
+	addsd   %xmm2,%xmm4				# cos+((1-t)-r - x*xx)
+	addsd   %xmm13,%xmm9				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm12,%xmm4				# cos+t
+	subsd   %xmm3,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	movapd	 %xmm2,p_temp2(%rsp)			# store x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	movapd	 %xmm11,p_temp3(%rsp)			# store r
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for 0.5*x2
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm10					# x6
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm12	# 0.5 *x2
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm10,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm0,%xmm2					# x3 recalculate
+	mulpd	%xmm3,%xmm3					# x4 recalculate
+
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm6,%xmm12					# 0.5 * x2 *xx
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd	%xmm12,%xmm4					# -0.5 * x2 *xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd	%xmm6,%xmm4					# x3 * zs +xx
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# t relaculate, -t = r-1
+	addpd	%xmm0,%xmm4					# +x
+	subpd   %xmm13,%xmm5					# + t
+
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	movapd	 %xmm3,p_temp3(%rsp)			# store x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	movapd	 %xmm10,p_temp2(%rsp)			# store r
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	mulpd	%xmm3,%xmm11					# x4
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for 0.5*x2
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm11					# x6
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm13	# 0.5 *x2
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm2					# x4 recalculate
+	mulpd	%xmm1,%xmm3					# x3 recalculate
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm7,%xmm13					# 0.5 * x2 *xx
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x3 * zs
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;;
+	subpd	%xmm13,%xmm5					# -0.5 * x2 *xx
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd	%xmm7,%xmm5					# +xx
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12	# t relaculate, -t = r-1
+	addpd	%xmm1,%xmm5					# +x
+	subpd   %xmm12,%xmm4					# + t
+
+	jmp 	.L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4:				#Derive from cossin_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+	movhlps	%xmm10,%xmm10					# get upper r for t for cos
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	movsd	%xmm0,%xmm8					# lower x for sin
+	mulsd	%xmm2,%xmm8					# lower x3 for sin
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# upper x4 for cos
+	movsd	%xmm8,%xmm2					# lower x3 for sin
+
+	movsd	 %xmm6,%xmm9					# lower xx
+								# note using odd reg
+
+	movlpd   p_temp2+8(%rsp),%xmm12		# upper r for cos term
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx for upper cos term
+	mulpd	%xmm1,%xmm7					# x * xx
+	movhlps	%xmm6,%xmm6
+	mulsd	p_temp2(%rsp),%xmm9 			# xx * 0.5*x2 for sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+								# x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm8= cos, xmm4= sin
+
+	subsd	%xmm9,%xmm4					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addsd   %xmm10,%xmm8					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp(%rsp),%xmm4			# +xx
+
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	subsd   %xmm12,%xmm8					# + t
+	addsd	%xmm0,%xmm4					# +x
+	subpd   %xmm13,%xmm5					# + t
+
+	movlhps	%xmm8,%xmm4
+
+	jmp	.L__vrd4_cos_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4:		#Derive from sincos_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcossinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcossinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zszc
+	addpd	%xmm9,%xmm5					# z
+
+	mulpd	%xmm0,%xmm2					# upper x3 for sin
+	mulsd	%xmm0,%xmm2					# lower x4 for cos
+	mulpd	%xmm3,%xmm3					# x4
+
+	movhlps	%xmm6,%xmm9					# upper xx for sin term
+								# note using odd reg
+
+	movlpd  p_temp2(%rsp),%xmm12			# lower r for cos term
+	movapd  p_temp3(%rsp),%xmm13			# r
+
+
+	mulpd	%xmm0,%xmm6					# x * xx for lower cos term
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	mulsd	p_temp2+8(%rsp),%xmm9 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+	mulpd	%xmm3,%xmm5
+								# x4 * zc
+
+	movhlps	%xmm4,%xmm8					# xmm8= sin, xmm4= cos
+	subsd	%xmm9,%xmm8					# x3zs - 0.5*x2*xx
+
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addsd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp+8(%rsp),%xmm8			# +xx
+
+	movhlps	%xmm0,%xmm0					# upper x for sin
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	subsd   %xmm12,%xmm4					# + t
+	subpd   %xmm13,%xmm5					# + t
+	addsd	%xmm0,%xmm8					# +x
+
+	movlhps	%xmm8,%xmm4
+
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+	movhlps	%xmm11,%xmm11					# get upper r for t for cos
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10		# 1 + (-t)	;trash t
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11		# 1 + (-t)	;trash t
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsincosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zcs
+
+	movsd	%xmm1,%xmm9					# lower x for sin
+	mulsd	%xmm3,%xmm9					# lower x3 for sin
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# upper x4 for cos
+	movsd	%xmm9,%xmm3					# lower x3 for sin
+
+	movsd	 %xmm7,%xmm8					# lower xx
+								# note using even reg
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+	movlpd   p_temp3+8(%rsp),%xmm13		# upper r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx for upper cos term
+	movhlps	%xmm7,%xmm7
+	mulsd	p_temp3(%rsp),%xmm8 			# xx * 0.5*x2 for sin term
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+								# x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= cos, xmm5= sin
+
+	subsd	%xmm8,%xmm5					# x3zs - 0.5*x2*xx
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd   %xmm11,%xmm9					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1(%rsp),%xmm5			# +xx
+
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12		# t relaculate, -t = r-1
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13		# t relaculate, -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subsd   %xmm13,%xmm9					# + t
+	addsd	%xmm1,%xmm5					# +x
+
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:		# Derived from sincos_sinsin
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	movhlps	%xmm11,%xmm11
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm6,%xmm10					# 0.5*x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsincosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zczs
+
+	movsd	%xmm3,%xmm12
+	mulsd	%xmm1,%xmm12					# low x3 for sin
+
+	mulpd	%xmm0,%xmm2					# x3
+	mulpd	%xmm3,%xmm3					# high x4 for cos
+	movsd	%xmm12,%xmm3					# low x3 for sin
+
+	movhlps	%xmm1,%xmm8					# upper x for cos term
+								# note using even reg
+	movlpd  p_temp3+8(%rsp),%xmm13			# upper r for cos term
+
+	mulsd	p_temp1+8(%rsp),%xmm8			# x * xx for upper cos term
+
+	mulsd	p_temp3(%rsp),%xmm7 			# xx * 0.5*x2 for lower sin term
+
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= cos, xmm5= sin
+
+	subsd	%xmm7,%xmm5					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm8,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm10,%xmm4					# x3*zs - 0.5*x2*xx
+	addsd   %xmm11,%xmm9					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1(%rsp),%xmm5			# +xx
+
+	addpd	%xmm6,%xmm4					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+
+	addsd	%xmm1,%xmm5					# +x
+	addpd	%xmm0,%xmm4					# +x
+	subsd   %xmm13,%xmm9					# + t
+
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcossinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcossinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcossinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm7,%xmm8					# upper xx for sin term
+								# note using even reg
+
+	movapd  p_temp2(%rsp),%xmm12			# r
+	movlpd  p_temp3(%rsp),%xmm13			# lower r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx for lower cos term
+
+	mulsd	p_temp3+8(%rsp),%xmm8 			# xx * 0.5*x2 for upper sin term
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm8,%xmm9					# x3zs - 0.5*x2*xx
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1+8(%rsp),%xmm9			# +xx
+
+	movhlps	%xmm1,%xmm1					# upper x for sin
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subsd   %xmm13,%xmm5					# + t
+	addsd	%xmm1,%xmm9					# +x
+
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsincos_sinsin_piby4:		# Derived from sincos_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcossinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcossinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm6,%xmm10					# 0.5x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcossinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm0,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm7,%xmm8					# upper xx for sin term
+								# note using even reg
+
+	movlpd  p_temp3(%rsp),%xmm13			# lower r for cos term
+
+	mulpd	%xmm1,%xmm7					# x * xx for lower cos term
+
+	mulsd	p_temp3+8(%rsp),%xmm8 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm8,%xmm9					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm10,%xmm4					# x3*zs - 0.5*x2*xx
+	addsd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1+8(%rsp),%xmm9			# +xx
+
+	movhlps	%xmm1,%xmm1					# upper x for sin
+	addpd	%xmm6,%xmm4					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	addsd	%xmm1,%xmm9					# +x
+	addpd	%xmm0,%xmm4					# +x
+	subsd   %xmm13,%xmm5					# + t
+
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsinsin_cossin_piby4:		# Derived from sincos_sinsin
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# x2
+	movapd	 %xmm6,p_temp(%rsp)			# xx
+
+	movhlps	%xmm10,%xmm10
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm7,%xmm11					# 0.5*x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zs
+
+
+	movsd	%xmm2,%xmm13
+	mulsd	%xmm0,%xmm13					# low x3 for sin
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm2,%xmm2					# high x4 for cos
+	movsd	%xmm13,%xmm2					# low x3 for sin
+
+
+	movhlps	%xmm0,%xmm9					# upper x for cos term								; note using even reg
+	movlpd  p_temp2+8(%rsp),%xmm12			# upper r for cos term
+	mulsd	p_temp+8(%rsp),%xmm9			# x * xx for upper cos term
+	mulsd	p_temp2(%rsp),%xmm6 			# xx * 0.5*x2 for lower sin term
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm8= cos, xmm4= sin
+	subsd	%xmm6,%xmm4					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm9,%xmm10					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm11,%xmm5					# x3*zs - 0.5*x2*xx
+
+	addsd   %xmm10,%xmm8					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp(%rsp),%xmm4			# +xx
+
+	addpd	%xmm7,%xmm5					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+
+	addsd	%xmm0,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+	subsd   %xmm12,%xmm8					# + t
+	movlhps	%xmm8,%xmm4
+
+	jmp 	.L__vrd4_cos_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:		# Derived from sincos_coscos
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lcossinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcossinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm7,%xmm11					# 0.5x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm0,%xmm2					# upper x3 for sin
+	mulsd	%xmm0,%xmm2					# lower x4 for cos
+
+	movhlps	%xmm6,%xmm9					# upper xx for sin term
+								# note using even reg
+
+	movlpd  p_temp2(%rsp),%xmm12			# lower r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx for lower cos term
+
+	mulsd	p_temp2+8(%rsp),%xmm9 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm9,%xmm8					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm11,%xmm5					# x3*zs - 0.5*x2*xx
+	addsd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp+8(%rsp),%xmm8			# +xx
+
+	movhlps	%xmm0,%xmm0					# upper x for sin
+	addpd	%xmm7,%xmm5					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+
+
+	addsd	%xmm0,%xmm8					# +x
+	addpd	%xmm1,%xmm5					# +x
+	subsd   %xmm12,%xmm4					# + t
+
+	movlhps	%xmm8,%xmm4
+
+	jmp 	.L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	movapd	 %xmm2,p_temp2(%rsp)			# copy of x2
+	movapd	 %xmm3,p_temp3(%rsp)			# copy of x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm2,%xmm10					# x6
+	mulpd	%xmm3,%xmm11					# x6
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm2		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm3		# 0.5 *x2
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm6,%xmm2					# 0.5 * x2 *xx
+	mulpd	%xmm7,%xmm3					# 0.5 * x2 *xx
+
+	mulpd	%xmm10,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zs
+
+	movapd	p_temp2(%rsp),%xmm10			# x2
+	movapd	p_temp3(%rsp),%xmm11			# x2
+
+	mulpd	%xmm0,%xmm10					# x3
+	mulpd	%xmm1,%xmm11					# x3
+
+	mulpd	%xmm10,%xmm4					# x3 * zs
+	mulpd	%xmm11,%xmm5					# x3 * zs
+
+	subpd	%xmm2,%xmm4					# -0.5 * x2 *xx
+	subpd	%xmm3,%xmm5					# -0.5 * x2 *xx
+
+	addpd	%xmm6,%xmm4					# +xx
+	addpd	%xmm7,%xmm5					# +xx
+
+	addpd	%xmm0,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+
+	jmp 	.L__vrd4_cos_cleanup

diff --git a/src/gas/vrdaexp.S b/src/gas/vrdaexp.S
new file mode 100644
index 0000000..1ee640e
--- /dev/null
+++ b/src/gas/vrdaexp.S

@@ -0,0 +1,619 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdaexp.asm
+#
+# An array implementation of the exp libm function.
+#
+# Prototype:
+#
+#    void vrda_exp(int n, double *x, double *y);
+#
+#   Computes e raised to the x power for an array of input values.
+#   Places the results into the supplied y array.
+# Does not perform error checking.   Denormal results are truncated to 0.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ	p_temp,0		# temporary for get/put bits operation
+.equ	p_temp1,0x10		# temporary for exponent multiply
+
+.equ	save_xa,0x020		#qword
+.equ	save_ya,0x028		#qword
+.equ	save_nv,0x030		#qword
+
+.equ	p_iter,0x038		# qword	storage for number of loop iterations
+
+.equ	p2_temp,0x40		# second temporary for get/put bits operation
+				# large enough for two vectors
+.equ	p2_temp1,0x60		# second temporary for exponent multiply
+				# large enough for two vectors
+.equ    save_rbx,0x080          #qword
+
+.equ	stack_size,0x088
+
+        .weak vrda_exp_
+        .set vrda_exp_,__vrda_exp__
+        .weak vrda_exp__
+        .set vrda_exp__,__vrda_exp__
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array exp
+#**     VRDA_EXP(N,X,Y)
+# C equivalent*/
+#void vrda_exp__(int * n, double *x, double *y)
+#{
+#       vrda_exp(*n,x,y);
+#}
+.globl __vrda_exp__
+    .type   __vrda_exp__,@function
+__vrda_exp__:
+    mov         (%rdi),%edi
+
+
+        .align 16
+        .p2align 4,,15
+
+
+# parameters are passed in by gcc as:
+# edi - int n
+# rsi - double *x
+# rdx - double *y
+
+
+.globl vrda_exp
+       .type   vrda_exp,@function
+vrda_exp:
+
+	sub		$stack_size,%rsp
+        mov             %rbx,save_rbx(%rsp)
+
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$2,%rax			# get number of iterations
+	jz		.L__vda_cleanup		# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi		# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+# In this second version, process the array 4 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+	movapd	.L__real_thirtytwo_by_log2(%rip),%xmm3	#
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movlpd	(%rsi),%xmm0
+	movhpd	8(%rsi),%xmm0
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+# compute the exponents
+
+#      Step 1. Reduce the argument.
+#        /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+#    r = x * thirtytwo_by_logbaseof2;
+		movapd	%xmm3,%xmm7
+	movapd	 %xmm0,p_temp(%rsp)
+	maxpd	.L__real_C0F0000000000000(%rip),%xmm0	# protect against very large negative, non-infinite numbers
+	mulpd	%xmm0,%xmm3
+
+		movlpd	-16(%rsi),%xmm6
+		movhpd	-8(%rsi),%xmm6
+		movapd	 %xmm6,p2_temp(%rsp)
+		maxpd	.L__real_C0F0000000000000(%rip),%xmm6
+		mulpd	%xmm6,%xmm7
+
+# save x for later.
+        minpd   .L__real_40F0000000000000(%rip),%xmm3    # protect against very large, non-infinite numbers
+
+#    /* Set n = nearest integer to r */
+	cvtpd2dq	%xmm3,%xmm4
+	lea		.L__two_to_jby32_lead_table(%rip),%rdi
+	lea		.L__two_to_jby32_trail_table(%rip),%rsi
+	cvtdq2pd	%xmm4,%xmm1
+        	minpd   .L__real_40F0000000000000(%rip),%xmm7    # protect against very large, non-infinite numbers
+
+ #    r1 = x - n * logbaseof2_by_32_lead;
+	movapd	.L__real_log2_by_32_lead(%rip),%xmm2	#
+	mulpd	%xmm1,%xmm2				#
+	movq	 %xmm4,p_temp1(%rsp)
+	subpd	%xmm2,%xmm0	 			# r1 in xmm0,
+
+		cvtpd2dq	%xmm7,%xmm2
+		cvtdq2pd	%xmm2,%xmm8
+
+#    r2 =   - n * logbaseof2_by_32_trail;
+	mulpd	.L__real_log2_by_32_tail(%rip),%xmm1	# r2 in xmm1
+#    j = n & 0x0000001f;
+	mov		$0x01f,%r9
+	mov		%r9,%r8
+	mov		p_temp1(%rsp),%ecx
+	and		%ecx,%r9d
+		movq	 %xmm2,p2_temp1(%rsp)
+		movapd	.L__real_log2_by_32_lead(%rip),%xmm9
+		mulpd	%xmm8,%xmm9
+		subpd	%xmm9,%xmm6					 			# r1b in xmm6
+		mulpd	.L__real_log2_by_32_tail(%rip),%xmm8	# r2b in xmm8
+
+	mov		p_temp1+4(%rsp),%edx
+	and		%edx,%r8d
+#    f1 = two_to_jby32_lead_table[j];
+#    f2 = two_to_jby32_trail_table[j];
+
+#    *m = (n - j) / 32;
+	sub		%r9d,%ecx
+	sar		$5,%ecx					#m
+	sub		%r8d,%edx
+	sar		$5,%edx
+
+
+	movapd	%xmm0,%xmm2
+	addpd	%xmm1,%xmm2   # r = r1 + r2
+
+		mov		$0x01f,%r11
+		mov		%r11,%r10
+		mov		p2_temp1(%rsp),%ebx
+		and		%ebx,%r11d
+#      Step 2. Compute the polynomial.
+#    q = r1 + (r2 +
+#              r*r*( 5.00000000000000008883e-01 +
+#                      r*( 1.66666666665260878863e-01 +
+#                      r*( 4.16666666662260795726e-02 +
+#                      r*( 8.33336798434219616221e-03 +
+#                      r*( 1.38889490863777199667e-03 ))))));
+#    q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+	movapd	%xmm2,%xmm1
+	movapd	.L__real_3f56c1728d739765(%rip),%xmm3	# 	1/720
+	movapd	.L__real_3FC5555555548F7C(%rip),%xmm0	# 	1/6
+# deal with infinite results
+	mov		$1024,%rax
+	movsx	%ecx,%rcx
+	cmp		%rax,%rcx
+
+	mulpd	%xmm2,%xmm3				# *x
+	mulpd	%xmm2,%xmm0				# *x
+	mulpd	%xmm2,%xmm1				# x*x
+	movapd	%xmm1,%xmm4
+
+	cmovg	%rax,%rcx				## if infinite, then set rcx to multiply
+							# by infinity
+	movsx	%edx,%rdx
+	cmp		%rax,%rdx
+
+		movapd	%xmm6,%xmm9
+		addpd	%xmm8,%xmm9  #  rb = r1b + r2b
+	addpd	.L__real_3F811115B7AA905E(%rip),%xmm3	# 	+ 1/120
+	addpd	.L__real_3fe0000000000000(%rip),%xmm0	# 	+ .5
+	mulpd	%xmm1,%xmm4				# x^4
+	mulpd	%xmm2,%xmm3				# *x
+
+	cmovg	%rax,%rdx				## if infinite, then set rcx to multiply
+							# by infinity
+# deal with denormal results
+	xor		%rax,%rax
+	add		$1023,%rcx			# add bias
+
+	mulpd	%xmm1,%xmm0				# *x^2
+	addpd	.L__real_3FA5555555545D4E(%rip),%xmm3	# 	+ 1/24
+	addpd	%xmm2,%xmm0				# 	+ x
+	mulpd	%xmm4,%xmm3				# *x^4
+
+# check for infinity or nan
+	movapd	p_temp(%rsp),%xmm2
+
+	cmovs	%rax,%rcx			## if denormal, then multiply by 0
+	shl		$52,%rcx		# build 2^n
+
+		sub		%r11d,%ebx
+		movapd	%xmm9,%xmm1
+	addpd	%xmm3,%xmm0			# q = final sum
+		movapd	.L__real_3f56c1728d739765(%rip),%xmm7	# 	1/720
+		movapd	.L__real_3FC5555555548F7C(%rip),%xmm3	# 	1/6
+
+#    *z2 = f2 + ((f1 + f2) * q);
+	movlpd	(%rsi,%r9,8),%xmm5		# f2
+	movlpd	(%rsi,%r8,8),%xmm4		# f2
+	addsd	(%rdi,%r8,8),%xmm4		# f1 + f2
+	addsd	(%rdi,%r9,8),%xmm5		# f1 + f2
+		mov p2_temp1+4(%rsp),%r8d
+		and	%r8d,%r10d
+		sar	$5,%ebx			#m
+		mulpd	%xmm9,%xmm7		# *x
+		mulpd	%xmm9,%xmm3		# *x
+		mulpd	%xmm9,%xmm1		# x*x
+		sub		%r10d,%r8d
+		sar		$5,%r8d
+# check for infinity or nan
+	andpd	.L__real_infinity(%rip),%xmm2
+	cmppd	$0,.L__real_infinity(%rip),%xmm2
+	add		$1023,%rdx		# add bias
+	shufpd	$0,%xmm4,%xmm5
+		movapd	%xmm1,%xmm4
+
+	cmovs	%rax,%rdx			## if denormal, then multiply by 0
+	shl		$52,%rdx		# build 2^n
+
+	mulpd	%xmm5,%xmm0
+	mov		 %rcx,p_temp1(%rsp) # get 2^n to memory
+	mov		 %rdx,p_temp1+8(%rsp) # get 2^n to memory
+	addpd	%xmm5,%xmm0			#z = z1 + z2   done with 1,2,3,4,5
+		mov		$1024,%rax
+		movsx	%ebx,%rbx
+		cmp		%rax,%rbx
+# end of splitexp
+#        /* Scale (z1 + z2) by 2.0**m */
+#          r = scaleDouble_1(z, n);
+
+
+		cmovg	%rax,%rbx		## if infinite, then set rcx to multiply
+						# by infinity
+		movsx	%r8d,%rdx
+		cmp		%rax,%rdx
+
+	movmskpd	%xmm2,%r8d
+
+		addpd	.L__real_3F811115B7AA905E(%rip),%xmm7	# 	+ 1/120
+		addpd	.L__real_3fe0000000000000(%rip),%xmm3	# 	+ .5
+		mulpd	%xmm1,%xmm4			# x^4
+		mulpd	%xmm9,%xmm7			# *x
+		cmovg	%rax,%rdx			## if infinite, then set rcx to multiply
+
+
+		xor		%rax,%rax
+		add		$1023,%rbx		# add bias
+
+		mulpd	%xmm1,%xmm3			# *x^2
+		addpd	.L__real_3FA5555555545D4E(%rip),%xmm7	# 	+ 1/24
+		addpd	%xmm9,%xmm3			# 	+ x
+		mulpd	%xmm4,%xmm7			# *x^4
+
+		cmovs	%rax,%rbx			## if denormal, then multiply by 0
+		shl		$52,%rbx		# build 2^n
+
+#      Step 3. Reconstitute.
+
+	mulpd	p_temp1(%rsp),%xmm0	# result *= 2^n
+		addpd	%xmm7,%xmm3			# q = final sum
+
+		movlpd	(%rsi,%r11,8),%xmm5 		# f2
+		movlpd	(%rsi,%r10,8),%xmm4 		# f2
+		addsd	(%rdi,%r10,8),%xmm4		# f1 + f2
+		addsd	(%rdi,%r11,8),%xmm5		# f1 + f2
+
+		add		$1023,%rdx		# add bias
+		cmovs	%rax,%rdx			## if denormal, then multiply by 0
+		shufpd	$0,%xmm4,%xmm5
+		shl		$52,%rdx		# build 2^n
+
+		mulpd	%xmm5,%xmm3
+		mov		 %rbx,p2_temp1(%rsp) # get 2^n to memory
+		mov		 %rdx,p2_temp1+8(%rsp) # get 2^n to memory
+		addpd	%xmm5,%xmm3						#z = z1 + z2
+
+		movapd	p2_temp(%rsp),%xmm2
+		andpd	.L__real_infinity(%rip),%xmm2
+		cmppd	$0,.L__real_infinity(%rip),%xmm2
+		movmskpd	%xmm2,%ebx
+	test		$3,%r8d
+		mulpd	p2_temp1(%rsp),%xmm3	# result *= 2^n
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them.  But it adds cycles for normal cases which
+# are supposed to be exceptions.  Using this branch with the
+# check above results in faster code for the normal cases.
+	jnz			.L__exp_naninf
+
+.L__vda_bottom1:
+# store the result _m128d
+	mov		save_ya(%rsp),%rdi	# get y_array pointer
+	movlpd	%xmm0,(%rdi)
+	movhpd	 %xmm0,8(%rdi)
+		test		$3,%ebx
+		jnz			.L__exp_naninf2
+
+.L__vda_bottom2:
+
+	prefetch	64(%rdi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+		movlpd	%xmm3,-16(%rdi)
+		movhpd	%xmm3,-8(%rdi)
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vda_top
+
+
+# see if we need to do any extras
+	mov		save_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz		.L__vda_cleanup
+
+
+#
+#
+.L__final_check:
+        mov             save_rbx(%rsp),%rbx             # restore rbx
+	add		$stack_size,%rsp
+	ret
+
+# at least one of the numbers needs special treatment
+.L__exp_naninf:
+	lea		p_temp(%rsp),%rcx
+	call  .L__naninf
+	jmp		.L__vda_bottom1
+.L__exp_naninf2:
+	lea		p2_temp(%rsp),%rcx
+	mov			%ebx,%r8d
+	movapd	%xmm3,%xmm0
+	call  .L__naninf
+	movapd	%xmm0,%xmm3
+	jmp		.L__vda_bottom2
+
+# This subroutine checks a double pair for nans and infinities and
+# produces the proper result from the exceptional inputs
+# Register assumptions:
+# Inputs:
+# r8d - mask of errors
+# xmm0 - computed result vector
+# rcx - pointing to memory image of inputs
+# Outputs:
+# xmm0 - new result vector
+# %rax,rdx,,%xmm2 all modified.
+.L__naninf:
+# check the first number
+	test	$1,%r8d
+	jz		.L__check2
+
+	mov		(%rcx),%rdx
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__enan1			# jump if mantissa not zero, so it's a NaN
+# inf
+	mov		%rdx,%rax
+	rcl		$1,%rax
+	jnc		.L__r1			# exp(+inf) = inf
+	xor		%rdx,%rdx			# exp(-inf) = 0
+	jmp		.L__r1
+
+#NaN
+.L__enan1:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__r1:
+	movd	%rdx,%xmm2
+	shufpd	$2,%xmm0,%xmm2
+	movsd	%xmm2,%xmm0
+# check the second number
+.L__check2:
+	test	$2,%r8d
+	jz		.L__r3
+	mov		8(%rcx),%rdx
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__enan2			# jump if mantissa not zero, so it's a NaN
+# inf
+	mov		%rdx,%rax
+	rcl		$1,%rax
+	jnc		.L__r2			# exp(+inf) = inf
+	xor		%rdx,%rdx			# exp(-inf) = 0
+	jmp		.L__r2
+
+#NaN
+.L__enan2:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__r2:
+	movd	%rdx,%xmm2
+	shufpd	$0,%xmm2,%xmm0
+.L__r3:
+	ret
+
+	.align	16
+# we jump here when we have an odd number of exp calls to make at the
+# end
+#  we assume that rdx is pointing at the next x array element,
+#  r8 at the next y array element.  The number of values left is in
+#  save_nv
+.L__vda_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__final_check         # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorpd		%xmm0,%xmm0
+	movlpd	 	%xmm0,p2_temp+8(%rsp)
+	movapd	 	%xmm0,p2_temp+16(%rsp)
+
+	mov		(%rsi),%rcx		# we know there's at least one
+	mov	 	%rcx,p2_temp(%rsp)
+	cmp		$2,%rax
+	jl		.L_vdacg
+
+	mov		8(%rsi),%rcx		# do the second value
+	mov	 	%rcx,p2_temp+8(%rsp)
+	cmp		$3,%rax
+	jl		.L_vdacg
+
+	mov		16(%rsi),%rcx		# do the third value
+	mov	 	%rcx,p2_temp+16(%rsp)
+
+.L_vdacg:
+	mov	$4,%rdi			# parameter for N
+	lea	p2_temp(%rsp),%rsi	# &x parameter
+	lea	p2_temp1(%rsp),%rdx	# &y parameter
+	call	vrda_exp@PLT			# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		save_ya(%rsp),%rdi
+	mov		save_nv(%rsp),%rax	# get number of values
+	mov	 	p2_temp1(%rsp),%rcx
+	mov		%rcx,(%rdi)		# we know there's at least one
+	cmp		$2,%rax
+	jl		.L_vdacgf
+
+	mov	 	p2_temp1+8(%rsp),%rcx
+	mov		%rcx,8(%rdi)		# do the second value
+	cmp		$3,%rax
+	jl		.L_vdacgf
+
+	mov	 	p2_temp1+16(%rsp),%rcx
+	mov		%rcx,16(%rdi)		# do the third value
+
+.L_vdacgf:
+	jmp		.L__final_check
+
+	.data
+        .align 64
+
+
+.L__real_3ff0000000000000:	.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000	# for alignment
+.L__real_4040000000000000:	.quad 0x04040000000000000	# 32
+				.quad 0x04040000000000000
+.L__real_40F0000000000000:      .quad 0x040F0000000000000        # 65536, to protect against really large numbers
+                                .quad 0x040F0000000000000
+.L__real_C0F0000000000000:	.quad 0x0C0F0000000000000	# -65536, to protect against really large negative numbers
+				.quad 0x0C0F0000000000000
+.L__real_3FA0000000000000:	.quad 0x03FA0000000000000	# 1/32
+				.quad 0x03FA0000000000000
+.L__real_3fe0000000000000:	.quad 0x03fe0000000000000	# 1/2
+				.quad 0x03fe0000000000000
+.L__real_infinity:		.quad 0x07ff0000000000000	#
+				.quad 0x07ff0000000000000	# for alignment
+.L__real_ninfinity:		.quad 0x0fff0000000000000	#
+				.quad 0x0fff0000000000000	# for alignment
+.L__real_thirtytwo_by_log2: 	.quad 0x040471547652b82fe	# thirtytwo_by_log2
+				.quad 0x040471547652b82fe
+.L__real_log2_by_32_lead:  	.quad 0x03f962e42fe000000	# log2_by_32_lead
+				.quad 0x03f962e42fe000000
+.L__real_log2_by_32_tail:  	.quad 0x0Bdcf473de6af278e	# -log2_by_32_tail
+				.quad 0x0Bdcf473de6af278e
+.L__real_3f56c1728d739765:	.quad 0x03f56c1728d739765	# 1.38889490863777199667e-03
+				.quad 0x03f56c1728d739765
+.L__real_3F811115B7AA905E:	.quad 0x03F811115B7AA905E	# 8.33336798434219616221e-03
+				.quad 0x03F811115B7AA905E
+.L__real_3FA5555555545D4E:	.quad 0x03FA5555555545D4E	# 4.16666666662260795726e-02
+				.quad 0x03FA5555555545D4E
+.L__real_3FC5555555548F7C:	.quad 0x03FC5555555548F7C	# 1.66666666665260878863e-01
+				.quad 0x03FC5555555548F7C
+
+
+.L__two_to_jby32_lead_table:
+	.quad	0x03ff0000000000000 # 1
+	.quad	0x03ff059b0d0000000		# 1.0219
+	.quad	0x03ff0b55860000000		# 1.04427
+	.quad	0x03ff11301d0000000		# 1.06714
+	.quad	0x03ff172b830000000		# 1.09051
+	.quad	0x03ff1d48730000000		# 1.11439
+	.quad	0x03ff2387a60000000		# 1.13879
+	.quad	0x03ff29e9df0000000		# 1.16372
+	.quad	0x03ff306fe00000000		# 1.18921
+	.quad	0x03ff371a730000000		# 1.21525
+	.quad	0x03ff3dea640000000		# 1.24186
+	.quad	0x03ff44e0860000000		# 1.26905
+	.quad	0x03ff4bfdad0000000		# 1.29684
+	.quad	0x03ff5342b50000000		# 1.32524
+	.quad	0x03ff5ab07d0000000		# 1.35426
+	.quad	0x03ff6247eb0000000		# 1.38391
+	.quad	0x03ff6a09e60000000		# 1.41421
+	.quad	0x03ff71f75e0000000		# 1.44518
+	.quad	0x03ff7a11470000000		# 1.47683
+	.quad	0x03ff8258990000000		# 1.50916
+	.quad	0x03ff8ace540000000		# 1.54221
+	.quad	0x03ff93737b0000000		# 1.57598
+	.quad	0x03ff9c49180000000		# 1.61049
+	.quad	0x03ffa5503b0000000		# 1.64576
+	.quad	0x03ffae89f90000000		# 1.68179
+	.quad	0x03ffb7f76f0000000		# 1.71862
+	.quad	0x03ffc199bd0000000		# 1.75625
+	.quad	0x03ffcb720d0000000		# 1.79471
+	.quad	0x03ffd5818d0000000		# 1.83401
+	.quad	0x03ffdfc9730000000		# 1.87417
+	.quad	0x03ffea4afa0000000		# 1.91521
+	.quad	0x03fff507650000000		# 1.95714
+	.quad 0					# for alignment
+.L__two_to_jby32_trail_table:
+	.quad	0x00000000000000000 # 0
+	.quad	0x03e48ac2ba1d73e2a		# 1.1489e-008
+	.quad	0x03e69f3121ec53172		# 4.83347e-008
+	.quad	0x03df25b50a4ebbf1b		# 2.67125e-010
+	.quad	0x03e68faa2f5b9bef9		# 4.65271e-008
+	.quad	0x03e368b9aa7805b80		# 5.24924e-009
+	.quad	0x03e6ceac470cd83f6		# 5.38622e-008
+	.quad	0x03e547f7b84b09745		# 1.90902e-008
+	.quad	0x03e64636e2a5bd1ab		# 3.79764e-008
+	.quad	0x03e5ceaa72a9c5154		# 2.69307e-008
+	.quad	0x03e682468446b6824		# 4.49684e-008
+	.quad	0x03e18624b40c4dbd0		# 1.41933e-009
+	.quad	0x03e54d8a89c750e5e		# 1.94147e-008
+	.quad	0x03e5a753e077c2a0f		# 2.46409e-008
+	.quad	0x03e6a90a852b19260		# 4.94813e-008
+	.quad	0x03e0d2ac258f87d03		# 8.48872e-010
+	.quad	0x03e59fcef32422cbf		# 2.42032e-008
+	.quad	0x03e61d8bee7ba46e2		# 3.3242e-008
+	.quad	0x03e4f580c36bea881		# 1.45957e-008
+	.quad	0x03e62999c25159f11		# 3.46453e-008
+	.quad	0x03e415506dadd3e2a		# 8.0709e-009
+	.quad	0x03e29b8bc9e8a0388		# 2.99439e-009
+	.quad	0x03e451f8480e3e236		# 9.83622e-009
+	.quad	0x03e41f12ae45a1224		# 8.35492e-009
+	.quad	0x03e62b5a75abd0e6a		# 3.48493e-008
+	.quad	0x03e47daf237553d84		# 1.11085e-008
+	.quad	0x03e6b0aa538444196		# 5.03689e-008
+	.quad	0x03e69df20d22a0798		# 4.81896e-008
+	.quad	0x03e69f7490e4bb40b		# 4.83654e-008
+	.quad	0x03e4bdcdaf5cb4656		# 1.29746e-008
+	.quad	0x03e452486cc2c7b9d		# 9.84533e-009
+	.quad	0x03e66dc8a80ce9f09		# 4.25828e-008
+	.quad 0					# for alignment
+

diff --git a/src/gas/vrdalog.S b/src/gas/vrdalog.S
new file mode 100644
index 0000000..cdbba18
--- /dev/null
+++ b/src/gas/vrdalog.S

@@ -0,0 +1,954 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdalog.s
+#
+# An array implementation of the log libm function.
+#
+# Prototype:
+#
+#    void vrda_log(int n, double *x, double *y);
+#
+#   Computes the natural log of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.  This version can compute logs in 44
+#   cycles with n <= 24
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+.equ	p_idx,0x010		# index storage
+.equ	p_xexp,0x020		# index storage
+
+.equ	p_x2,0x030		# temporary for error checking operation
+.equ	p_idx2,0x040		# index storage
+.equ	p_xexp2,0x050		# index storage
+
+.equ	save_xa,0x060		#qword
+.equ	save_ya,0x068		#qword
+.equ	save_nv,0x070		#qword
+.equ	p_iter,0x078		# qword	storage for number of loop iterations
+
+.equ	save_rbx,0x080		#qword
+
+
+.equ	p2_temp,0x090		# second temporary for get/put bits operation
+.equ	p2_temp1,0x0b0		# second temporary for exponent multiply
+
+.equ	p_n1,0x0c0		# temporary for near one check
+.equ	p_n12,0x0d0		# temporary for near one check
+
+
+.equ	stack_size,0x0e8
+
+	.weak vrda_log_
+	.set vrda_log_,__vrda_log__
+	.weak vrda_log__
+	.set vrda_log__,__vrda_log__
+
+# parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array log
+#**     VRDA_LOG(N,X,Y)
+# C equivalent*/
+#void vrda_log__(int * n, double *x, double *y)
+#{
+#       vrda_log(*n,x,y);
+#}
+.globl __vrda_log__
+    .type   __vrda_log__,@function
+__vrda_log__:
+    mov		(%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+.globl vrda_log
+    .type   vrda_log,@function
+vrda_log:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$2,%rax			# get number of iterations
+	jz		.L__vda_cleanup		# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi		# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movlpd	(%rsi),%xmm0
+	movhpd	8(%rsi),%xmm0
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+                movlpd  -16(%rsi),%xmm7
+                movhpd  -8(%rsi),%xmm7
+
+# compute the logs
+
+##  if NaN or inf
+	movdqa	%xmm0,p_x(%rsp)	# save the input values
+
+#      /* Store the exponent of x in xexp and put
+#         f into the range [0.5,1) */
+
+	pxor	%xmm1,%xmm1
+	movdqa	%xmm0,%xmm3
+	psrlq	$52,%xmm3
+	psubq	.L__mask_1023(%rip),%xmm3
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm6			# xexp
+		movdqa	%xmm7,p_x2(%rsp)	# save the input values
+	movdqa	%xmm0,%xmm2
+	subpd	.L__real_one(%rip),%xmm2
+
+	movapd	%xmm6,p_xexp(%rsp)
+	andpd	.L__real_notsign(%rip),%xmm2
+	xor		%rax,%rax
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+
+	cmppd	$1,.L__real_threshold(%rip),%xmm2
+	movmskpd	%xmm2,%ecx
+	movdqa	%xmm3,%xmm4
+	mov			%ecx,p_n1(%rsp)
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrlq	$45,%xmm3
+	movdqa	%xmm3,%xmm2
+	psrlq	$1,%xmm3
+	paddq	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm2
+	paddq	%xmm2,%xmm3
+
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm1
+		pxor	%xmm7,%xmm7
+		movdqa	p_x2(%rsp),%xmm2
+		movapd	p_x2(%rsp),%xmm5
+		psrlq	$52,%xmm2
+		psubq	.L__mask_1023(%rip),%xmm2
+		packssdw	%xmm7,%xmm2
+		subpd	.L__real_one(%rip),%xmm5
+		andpd	.L__real_notsign(%rip),%xmm5
+		cvtdq2pd	%xmm2,%xmm6			# xexp
+	xor		%rcx,%rcx
+		cmppd	$1,.L__real_threshold(%rip),%xmm5
+	movq	 %xmm3,p_idx(%rsp)
+
+# reduce and get u
+	por		.L__real_half(%rip),%xmm4
+	movdqa	%xmm4,%xmm2
+		movapd	%xmm6,p_xexp2(%rsp)
+
+	# do near one check
+		movmskpd	%xmm5,%edx
+		mov			%edx,p_n12(%rsp)
+
+	mulpd	.L__real_3f80000000000000(%rip),%xmm1				# f1 = index/128
+
+
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	mov		p_idx(%rsp),%eax
+		movdqa	p_x2(%rsp),%xmm6
+
+	movapd	.L__real_half(%rip),%xmm5							# .5
+	subpd	%xmm1,%xmm2											# f2 = f - f1
+		pand	.L__real_mant(%rip),%xmm6
+	mulpd	%xmm2,%xmm5
+	addpd	%xmm5,%xmm1
+
+		movdqa	%xmm6,%xmm8
+		psrlq	$45,%xmm6
+		movdqa	%xmm6,%xmm4
+
+		psrlq	$1,%xmm6
+		paddq	.L__mask_040(%rip),%xmm6
+		pand	.L__mask_001(%rip),%xmm4
+		paddq	%xmm4,%xmm6
+# do error checking here for scheduling.  Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+##  if NaN or inf
+	movapd	%xmm0,%xmm3
+	andpd	.L__real_inf(%rip),%xmm3
+	cmppd	$0,.L__real_inf(%rip),%xmm3
+	movmskpd	%xmm3,%r8d
+		packssdw	%xmm7,%xmm6
+		por		.L__real_half(%rip),%xmm8
+		movq	 %xmm6,p_idx2(%rsp)
+		cvtdq2pd	%xmm6,%xmm9
+
+	cmppd	$2,.L__real_zero(%rip),%xmm0
+		mulpd	.L__real_3f80000000000000(%rip),%xmm9				# f1 = index/128
+	movmskpd	%xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+	divpd	%xmm1,%xmm2				# u
+
+# compute the index into the log tables
+#
+
+        movlpd   -512(%rdx,%rax,8),%xmm0                # z1
+        mov             p_idx+4(%rsp),%ecx
+        movhpd   -512(%rdx,%rcx,8),%xmm0                # z1
+# solve for ln(1+u)
+	movapd	%xmm2,%xmm1				# u
+	mulpd	%xmm2,%xmm2				# u^2
+	movapd	%xmm2,%xmm5
+	movapd	.L__real_cb3(%rip),%xmm3
+	mulpd	%xmm2,%xmm3				#Cu2
+	mulpd	%xmm1,%xmm5				# u^3
+	addpd	.L__real_cb2(%rip),%xmm3 #B+Cu2
+
+	mulpd	%xmm5,%xmm2				# u^5
+	movapd	.L__real_log2_lead(%rip),%xmm4
+
+	mulpd	.L__real_cb1(%rip),%xmm5 #Au3
+	addpd	%xmm5,%xmm1				# u+Au3
+	mulpd	%xmm3,%xmm2				# u5(B+Cu2)
+
+	movapd	p_xexp(%rsp),%xmm5		# xexp
+	addpd	%xmm2,%xmm1				# poly
+# recombine
+	mulpd	%xmm5,%xmm4				# xexp * log2_lead
+	addpd	%xmm4,%xmm0				#r1
+	lea		.L__np_ln_tail_table(%rip),%rdx
+        movlpd   -512(%rdx,%rax,8),%xmm4                #z2     +=q
+        movhpd   -512(%rdx,%rcx,8),%xmm4                #z2     +=q
+		lea		.L__np_ln_lead_table(%rip),%rdx
+		mov		p_idx2(%rsp),%eax
+		mov		p_idx2+4(%rsp),%ecx
+	addpd	%xmm4,%xmm1
+
+	mulpd	.L__real_log2_tail(%rip),%xmm5
+
+		movapd	.L__real_half(%rip),%xmm4							# .5
+		subpd	%xmm9,%xmm8											# f2 = f - f1
+		mulpd	%xmm8,%xmm4
+		addpd	%xmm4,%xmm9
+
+	addpd	%xmm5,%xmm1				#r2
+		divpd	%xmm9,%xmm8				# u
+		movapd	p_x2(%rsp),%xmm3
+		andpd	.L__real_inf(%rip),%xmm3
+		cmppd	$0,.L__real_inf(%rip),%xmm3
+		movmskpd	%xmm3,%r10d
+		movapd	p_x2(%rsp),%xmm6
+		cmppd	$2,.L__real_zero(%rip),%xmm6
+		movmskpd	%xmm6,%r11d
+
+# check for nans/infs
+	test		$3,%r8d
+	addpd	%xmm1,%xmm0
+	jnz		.L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+	test		$3,%r9d
+	jnz		.L__z_or_n
+
+.L__vlog2:
+# store the result _m128d
+	mov		save_ya(%rsp),%rdi	# get y_array pointer
+	movlpd	%xmm0,(%rdi)
+	movhpd	 %xmm0,8(%rdi)
+
+	# It seems like a good idea to try and interleave
+	# even more of the following code sooner into the
+	# program.  But there were conflicts with the table
+	# index registers, making the problem difficult.
+	# After a lot of work in a branch of this file,
+	# I was not able to match the speed of this version.
+	# CodeAnalyst shows that there is lots of unused add
+	# pipe time around the divides, but the processor
+	# doesn't seem to be able to schedule in those slots.
+
+	        movlpd   -512(%rdx,%rax,8),%xmm7                #z2     +=q
+        	movhpd   -512(%rdx,%rcx,8),%xmm7                #z2     +=q
+
+# check for near one
+	mov			p_n1(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one1
+.L__vlog2n:
+
+	# solve for ln(1+u)
+		movapd	%xmm8,%xmm9				# u
+		mulpd	%xmm8,%xmm8				# u^2
+		movapd	%xmm8,%xmm5
+		movapd	.L__real_cb3(%rip),%xmm3
+		mulpd	%xmm8,%xmm3				#Cu2
+		mulpd	%xmm9,%xmm5				# u^3
+		addpd	.L__real_cb2(%rip),%xmm3 		#B+Cu2
+
+		mulpd	%xmm5,%xmm8				# u^5
+		movapd	.L__real_log2_lead(%rip),%xmm4
+
+		mulpd	.L__real_cb1(%rip),%xmm5 		#Au3
+		addpd	%xmm5,%xmm9				# u+Au3
+		mulpd	%xmm3,%xmm8				# u5(B+Cu2)
+
+		movapd	p_xexp2(%rsp),%xmm5			# xexp
+		addpd	%xmm8,%xmm9				# poly
+	# recombine
+		mulpd	%xmm5,%xmm4
+		addpd	%xmm4,%xmm7				#r1
+		lea		.L__np_ln_tail_table(%rip),%rdx
+	        movlpd   -512(%rdx,%rax,8),%xmm2                #z2     +=q
+        	movhpd   -512(%rdx,%rcx,8),%xmm2                #z2     +=q
+		addpd	%xmm2,%xmm9
+
+		mulpd	.L__real_log2_tail(%rip),%xmm5
+
+		addpd	%xmm5,%xmm9				#r2
+
+	# check for nans/infs
+		test		$3,%r10d
+		addpd	%xmm9,%xmm7
+		jnz		.L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+		test		$3,%r11d
+		jnz		.L__z_or_n2
+
+.L__vlog4:
+	mov			p_n12(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one2
+
+.L__vlog4n:
+
+
+#__vda_bottom2:
+
+	prefetch	64(%rdi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+		movlpd	%xmm7,-16(%rdi)
+		movhpd	%xmm7,-8(%rdi)
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vda_top
+
+
+# see if we need to do any extras
+	mov		save_nv(%rsp),%rax	# get number of values
+	test		%rax,%rax
+	jnz		.L__vda_cleanup
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+	.align	16
+.Lboth_nearone:
+# saves 10 cycles
+#      r = x - 1.0;
+	movapd	.L__real_two(%rip),%xmm2
+	subpd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addpd	%xmm0,%xmm2
+	movapd	%xmm0,%xmm1
+	divpd	%xmm2,%xmm1		# u
+	movapd	.L__real_ca4(%rip),%xmm4	  #D
+	movapd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movapd	%xmm0,%xmm6
+	mulpd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addpd	%xmm1,%xmm1		#u
+	movapd	%xmm1,%xmm2
+	mulpd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulpd	%xmm1,%xmm5		# Cu
+	movapd	%xmm1,%xmm3
+	mulpd	%xmm2,%xmm3		# u^3
+	mulpd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulpd	%xmm3,%xmm4		#Du^3
+
+	addpd	.L__real_ca1(%rip),%xmm2	# +A
+	movapd	%xmm3,%xmm1
+	mulpd	%xmm1,%xmm1		# u^6
+	addpd	%xmm4,%xmm5		#Cu+Du3
+
+	mulpd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulpd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addpd	%xmm1,%xmm2
+	subpd	%xmm6,%xmm2		# -correction
+
+#      return r + r2;
+	addpd	%xmm2,%xmm0
+	ret
+
+	.align	16
+.L__near_one1:
+	cmp	$3,%r9d
+	jnz		.L__n1nb1
+
+	movapd	p_x(%rsp),%xmm0
+	call	.Lboth_nearone
+	movlpd	%xmm0,(%rdi)
+	movhpd	%xmm0,8(%rdi)
+	jmp		.L__vlog2n
+
+	.align	16
+.L__n1nb1:
+	test	$1,%r9d
+	jz		.L__lnn12
+
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,(%rdi)
+
+.L__lnn12:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn1e
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,8(%rdi)
+
+.L__lnn1e:
+	jmp		.L__vlog2n
+
+
+	.align	16
+.L__near_one2:
+	cmp	$3,%r9d
+	jnz		.L__n1nb2
+
+	movapd	p_x2(%rsp),%xmm0
+	call	.Lboth_nearone
+	movapd	%xmm0,%xmm7
+	jmp		.L__vlog4n
+
+	.align	16
+.L__n1nb2:
+	test	$1,%r9d
+	jz		.L__lnn22
+
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__ln1
+	movsd	%xmm0,%xmm7
+
+.L__lnn22:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn2e
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__ln1
+	movlhps	%xmm0,%xmm7
+
+.L__lnn2e:
+	jmp		.L__vlog4n
+
+	.align	16
+
+.L__ln1:
+# saves 10 cycles
+#      r = x - 1.0;
+	movlpd	.L__real_two(%rip),%xmm2
+	subsd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addsd	%xmm0,%xmm2
+	movsd	%xmm0,%xmm1
+	divsd	%xmm2,%xmm1		# u
+	movlpd	.L__real_ca4(%rip),%xmm4	  #D
+	movlpd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movsd	%xmm0,%xmm6
+	mulsd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addsd	%xmm1,%xmm1		#u
+	movsd	%xmm1,%xmm2
+	mulsd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulsd	%xmm1,%xmm5		# Cu
+	movsd	%xmm1,%xmm3
+	mulsd	%xmm2,%xmm3		# u^3
+	mulsd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulsd	%xmm3,%xmm4		#Du^3
+
+	addsd	.L__real_ca1(%rip),%xmm2	# +A
+	movsd	%xmm3,%xmm1
+	mulsd	%xmm1,%xmm1		# u^6
+	addsd	%xmm4,%xmm5		#Cu+Du3
+
+	mulsd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulsd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addsd	%xmm1,%xmm2
+	subsd	%xmm6,%xmm2		# -correction
+
+#      return r + r2;
+	addsd	%xmm2,%xmm0
+	ret
+
+	.align	16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+	test		$1,%r8d		# first number?
+	jz		.L__lninf2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rdx
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+	test		$2,%r8d		# second number?
+	jz		.L__lninfe
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rdx
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+	jmp		.L__vlog1		# continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+	test		$1,%r10d		# first number?
+	jz		.L__lninf22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm7,%xmm1		# save the inputs
+	mov		p_x2(%rsp),%rdx
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm7,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+	movapd	%xmm0,%xmm7
+
+.L__lninf22:
+	test		$2,%r10d		# second number?
+	jz		.L__lninfe2
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rdx
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+	jmp		.L__vlog3		# continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__lnan					# jump if mantissa not zero, so it's a NaN
+# inf
+	rcl		$1,%rdx
+	jnc		.L__lne2					# log(+inf) = inf
+# negative x
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+#NaN
+.L__lnan:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__lne:
+	movd	%rdx,%xmm0
+.L__lne2:
+	ret
+
+	.align	16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+	test		$1,%r9d		# first number?
+	jz		.L__zn2
+
+	mov		%rax,p2_temp(%rsp)
+ 	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+	test		$2,%r9d		# second number?
+	jz		.L__zne
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne:
+	jmp		.L__vlog2
+
+.L__z_or_n2:
+	test		$1,%r11d		# first number?
+	jz		.L__zn22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm7,%xmm0
+	movapd	%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+	test		$2,%r11d		# second number?
+	jz		.L__zne2
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+	jmp		.L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+	shl		$1,%rax
+	jnz		.L__zn_x		 # if just a carry, then must be negative
+	movlpd	.L__real_ninf(%rip),%xmm0  # C99 specs -inf for +-0
+	ret
+.L__zn_x:
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+#  we assume that rdx is pointing at the next x array element,
+#  r8 at the next y array element.  The number of values left is in
+#  save_nv
+.L__vda_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__finish         # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorpd		%xmm0,%xmm0
+	movlpd	%xmm0,p_x+8(%rsp)
+	movapd	%xmm0,p_x+16(%rsp)
+
+	mov		(%rsi),%rcx			# we know there's at least one
+	mov	 	%rcx,p_x(%rsp)
+	cmp		$2,%rax
+	jl		.L__vdacg
+
+	mov		8(%rsi),%rcx			# do the second value
+	mov	 	%rcx,p_x+8(%rsp)
+	cmp		$3,%rax
+	jl		.L__vdacg
+
+	mov		16(%rsi),%rcx			# do the third value
+	mov	 	%rcx,p_x+16(%rsp)
+
+.L__vdacg:
+	mov		$4,%rdi				# parameter for N
+	lea		p_x(%rsp),%rsi		# &x parameter
+	lea		p2_temp(%rsp),%rdx	# &y parameter
+	call		vrda_log@PLT		# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		save_ya(%rsp),%rdi
+	mov		save_nv(%rsp),%rax	# get number of values
+	mov	 	p2_temp(%rsp),%rcx
+	mov		%rcx,(%rdi)			# we know there's at least one
+	cmp		$2,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+8(%rsp),%rcx
+	mov		%rcx,8(%rdi)			# do the second value
+	cmp		$3,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+16(%rsp),%rcx
+	mov		%rcx,16(%rdi)			# do the third value
+
+.L__vdacgf:
+	jmp		.L__finish
+
+	.data
+	.align	64
+
+.L__real_one:			.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_two:			.quad 0x04000000000000000	# 2.0
+				.quad 0x04000000000000000
+.L__real_ninf:			.quad 0x0fff0000000000000	# -inf
+				.quad 0x0fff0000000000000
+.L__real_inf:			.quad 0x07ff0000000000000	# +inf
+				.quad 0x07ff0000000000000
+.L__real_nan:			.quad 0x07ff8000000000000	# NaN
+				.quad 0x07ff8000000000000
+
+.L__real_zero:			.quad 0x00000000000000000	# 0.0
+				.quad 0x00000000000000000
+
+.L__real_sign:			.quad 0x08000000000000000	# sign bit
+				.quad 0x08000000000000000
+.L__real_notsign:		.quad 0x07ffFFFFFFFFFFFFF	# ^sign bit
+				.quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold:		.quad 0x03F9EB85000000000	# .03
+				.quad 0x03F9EB85000000000
+.L__real_qnanbit:		.quad 0x00008000000000000	# quiet nan bit
+				.quad 0x00008000000000000
+.L__real_mant:			.quad 0x0000FFFFFFFFFFFFF	# mantissa bits
+				.quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000:	.quad 0x03f80000000000000	# /* 0.0078125 = 1/128 */
+				.quad 0x03f80000000000000
+.L__mask_1023:			.quad 0x000000000000003ff	#
+				.quad 0x000000000000003ff
+.L__mask_040:			.quad 0x00000000000000040	#
+				.quad 0x00000000000000040
+.L__mask_001:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001
+
+.L__real_ca1:			.quad 0x03fb55555555554e6	# 8.33333333333317923934e-02
+				.quad 0x03fb55555555554e6
+.L__real_ca2:			.quad 0x03f89999999bac6d4	# 1.25000000037717509602e-02
+				.quad 0x03f89999999bac6d4
+.L__real_ca3:			.quad 0x03f62492307f1519f	# 2.23213998791944806202e-03
+				.quad 0x03f62492307f1519f
+.L__real_ca4:			.quad 0x03f3c8034c85dfff0	# 4.34887777707614552256e-04
+				.quad 0x03f3c8034c85dfff0
+
+.L__real_cb1:			.quad 0x03fb5555555555557	# 8.33333333333333593622e-02
+				.quad 0x03fb5555555555557
+.L__real_cb2:			.quad 0x03f89999999865ede	# 1.24999999978138668903e-02
+				.quad 0x03f89999999865ede
+.L__real_cb3:			.quad 0x03f6249423bd94741	# 2.23219810758559851206e-03
+				.quad 0x03f6249423bd94741
+.L__real_log2_lead:  		.quad 0x03fe62e42e0000000	# log2_lead	  6.93147122859954833984e-01
+				.quad 0x03fe62e42e0000000
+.L__real_log2_tail: 		.quad 0x03e6efa39ef35793c	# log2_tail	  5.76999904754328540596e-08
+				.quad 0x03e6efa39ef35793c
+
+.L__real_half:			.quad 0x03fe0000000000000	# 1/2
+				.quad 0x03fe0000000000000
+
+
+.L__np_ln_lead_table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3f8fc0a800000000		# 1.55041813850402832031e-02
+	.quad	0x3f9f829800000000		# 3.07716131210327148438e-02
+	.quad	0x3fa7745800000000		# 4.58095073699951171875e-02
+	.quad	0x3faf0a3000000000		# 6.06245994567871093750e-02
+	.quad	0x3fb341d700000000		# 7.52233862876892089844e-02
+	.quad	0x3fb6f0d200000000		# 8.96121263504028320312e-02
+	.quad	0x3fba926d00000000		# 1.03796780109405517578e-01
+	.quad	0x3fbe270700000000		# 1.17783010005950927734e-01
+	.quad	0x3fc0d77e00000000		# 1.31576299667358398438e-01
+	.quad	0x3fc2955280000000		# 1.45181953907012939453e-01
+	.quad	0x3fc44d2b00000000		# 1.58604979515075683594e-01
+	.quad	0x3fc5ff3000000000		# 1.71850204467773437500e-01
+	.quad	0x3fc7ab8900000000		# 1.84922337532043457031e-01
+	.quad	0x3fc9525a80000000		# 1.97825729846954345703e-01
+	.quad	0x3fcaf3c900000000		# 2.10564732551574707031e-01
+	.quad	0x3fcc8ff780000000		# 2.23143517971038818359e-01
+	.quad	0x3fce270700000000		# 2.35566020011901855469e-01
+	.quad	0x3fcfb91800000000		# 2.47836112976074218750e-01
+	.quad	0x3fd0a324c0000000		# 2.59957492351531982422e-01
+	.quad	0x3fd1675c80000000		# 2.71933674812316894531e-01
+	.quad	0x3fd22941c0000000		# 2.83768117427825927734e-01
+	.quad	0x3fd2e8e280000000		# 2.95464158058166503906e-01
+	.quad	0x3fd3a64c40000000		# 3.07025015354156494141e-01
+	.quad	0x3fd4618bc0000000		# 3.18453729152679443359e-01
+	.quad	0x3fd51aad80000000		# 3.29753279685974121094e-01
+	.quad	0x3fd5d1bd80000000		# 3.40926527976989746094e-01
+	.quad	0x3fd686c800000000		# 3.51976394653320312500e-01
+	.quad	0x3fd739d7c0000000		# 3.62905442714691162109e-01
+	.quad	0x3fd7eaf800000000		# 3.73716354370117187500e-01
+	.quad	0x3fd89a3380000000		# 3.84411692619323730469e-01
+	.quad	0x3fd9479400000000		# 3.94993782043457031250e-01
+	.quad	0x3fd9f323c0000000		# 4.05465066432952880859e-01
+	.quad	0x3fda9cec80000000		# 4.15827870368957519531e-01
+	.quad	0x3fdb44f740000000		# 4.26084339618682861328e-01
+	.quad	0x3fdbeb4d80000000		# 4.36236739158630371094e-01
+	.quad	0x3fdc8ff7c0000000		# 4.46287095546722412109e-01
+	.quad	0x3fdd32fe40000000		# 4.56237375736236572266e-01
+	.quad	0x3fddd46a00000000		# 4.66089725494384765625e-01
+	.quad	0x3fde744240000000		# 4.75845873355865478516e-01
+	.quad	0x3fdf128f40000000		# 4.85507786273956298828e-01
+	.quad	0x3fdfaf5880000000		# 4.95077252388000488281e-01
+	.quad	0x3fe02552a0000000		# 5.04556000232696533203e-01
+	.quad	0x3fe0723e40000000		# 5.13945698738098144531e-01
+	.quad	0x3fe0be72e0000000		# 5.23248136043548583984e-01
+	.quad	0x3fe109f380000000		# 5.32464742660522460938e-01
+	.quad	0x3fe154c3c0000000		# 5.41597247123718261719e-01
+	.quad	0x3fe19ee6a0000000		# 5.50647079944610595703e-01
+	.quad	0x3fe1e85f40000000		# 5.59615731239318847656e-01
+	.quad	0x3fe23130c0000000		# 5.68504691123962402344e-01
+	.quad	0x3fe2795e00000000		# 5.77315330505371093750e-01
+	.quad	0x3fe2c0e9e0000000		# 5.86049020290374755859e-01
+	.quad	0x3fe307d720000000		# 5.94707071781158447266e-01
+	.quad	0x3fe34e2880000000		# 6.03290796279907226562e-01
+	.quad	0x3fe393e0c0000000		# 6.11801505088806152344e-01
+	.quad	0x3fe3d90260000000		# 6.20240390300750732422e-01
+	.quad	0x3fe41d8fe0000000		# 6.28608644008636474609e-01
+	.quad	0x3fe4618bc0000000		# 6.36907458305358886719e-01
+	.quad	0x3fe4a4f840000000		# 6.45137906074523925781e-01
+	.quad	0x3fe4e7d800000000		# 6.53301239013671875000e-01
+	.quad	0x3fe52a2d20000000		# 6.61398470401763916016e-01
+	.quad	0x3fe56bf9c0000000		# 6.69430613517761230469e-01
+	.quad	0x3fe5ad4040000000		# 6.77398800849914550781e-01
+	.quad	0x3fe5ee02a0000000		# 6.85303986072540283203e-01
+	.quad	0x3fe62e42e0000000		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_tail_table:
+	.quad	0x00000000000000000 # 0	; 0.00000000000000000000e+00
+	.quad	0x03e361f807c79f3db		# 5.15092497094772879206e-09
+	.quad	0x03e6873c1980267c8		# 4.55457209735272790188e-08
+	.quad	0x03e5ec65b9f88c69e		# 2.86612990859791781788e-08
+	.quad	0x03e58022c54cc2f99		# 2.23596477332056055352e-08
+	.quad	0x03e62c37a3a125330		# 3.49498983167142274770e-08
+	.quad	0x03e615cad69737c93		# 3.23392843005887000414e-08
+	.quad	0x03e4d256ab1b285e9		# 1.35722380472479366661e-08
+	.quad	0x03e5b8abcb97a7aa2		# 2.56504325268044191098e-08
+	.quad	0x03e6f34239659a5dc		# 5.81213608741512136843e-08
+	.quad	0x03e6e07fd48d30177		# 5.59374849578288093334e-08
+	.quad	0x03e6b32df4799f4f6		# 5.06615629004996189970e-08
+	.quad	0x03e6c29e4f4f21cf8		# 5.24588857848400955725e-08
+	.quad	0x03e1086c848df1b59		# 9.61968535632653505972e-10
+	.quad	0x03e4cf456b4764130		# 1.34829655346594463137e-08
+	.quad	0x03e63a02ffcb63398		# 3.65557749306383026498e-08
+	.quad	0x03e61e6a6886b0976		# 3.33431709374069198903e-08
+	.quad	0x03e6b8abcb97a7aa2		# 5.13008650536088382197e-08
+	.quad	0x03e6b578f8aa35552		# 5.09285070380306053751e-08
+	.quad	0x03e6139c871afb9fc		# 3.20853940845502057341e-08
+	.quad	0x03e65d5d30701ce64		# 4.06713248643004200446e-08
+	.quad	0x03e6de7bcb2d12142		# 5.57028186706125221168e-08
+	.quad	0x03e6d708e984e1664		# 5.48356693724804282546e-08
+	.quad	0x03e556945e9c72f36		# 1.99407553679345001938e-08
+	.quad	0x03e20e2f613e85bda		# 1.96585517245087232086e-09
+	.quad	0x03e3cb7e0b42724f6		# 6.68649386072067321503e-09
+	.quad	0x03e6fac04e52846c7		# 5.89936034642113390002e-08
+	.quad	0x03e5e9b14aec442be		# 2.85038578721554472484e-08
+	.quad	0x03e6b5de8034e7126		# 5.09746772910284482606e-08
+	.quad	0x03e6dc157e1b259d3		# 5.54234668933210171467e-08
+	.quad	0x03e3b05096ad69c62		# 6.29100830926604004874e-09
+	.quad	0x03e5c2116faba4cdd		# 2.61974119468563937716e-08
+	.quad	0x03e665fcc25f95b47		# 4.16752115011186398935e-08
+	.quad	0x03e5a9a08498d4850		# 2.47747534460820790327e-08
+	.quad	0x03e6de647b1465f77		# 5.56922172017964209793e-08
+	.quad	0x03e5da71b7bf7861d		# 2.76162876992552906035e-08
+	.quad	0x03e3e6a6886b09760		# 7.08169709942321478061e-09
+	.quad	0x03e6f0075eab0ef64		# 5.77453510221151779025e-08
+	.quad	0x03e33071282fb989b		# 4.43021445893361960146e-09
+	.quad	0x03e60eb43c3f1bed2		# 3.15140984357495864573e-08
+	.quad	0x03e5faf06ecb35c84		# 2.95077445089736670973e-08
+	.quad	0x03e4ef1e63db35f68		# 1.44098510263167149349e-08
+	.quad	0x03e469743fb1a71a5		# 1.05196987538551827693e-08
+	.quad	0x03e6c1cdf404e5796		# 5.23641361722697546261e-08
+	.quad	0x03e4094aa0ada625e		# 7.72099925253243069458e-09
+	.quad	0x03e6e2d4c96fde3ec		# 5.62089493829364197156e-08
+	.quad	0x03e62f4d5e9a98f34		# 3.53090261098577946927e-08
+	.quad	0x03e6467c96ecc5cbe		# 3.80080516835568242269e-08
+	.quad	0x03e6e7040d03dec5a		# 5.66961038386146408282e-08
+	.quad	0x03e67bebf4282de36		# 4.42287063097349852717e-08
+	.quad	0x03e6289b11aeb783f		# 3.45294525105681104660e-08
+	.quad	0x03e5a891d1772f538		# 2.47132034530447431509e-08
+	.quad	0x03e634f10be1fb591		# 3.59655343422487209774e-08
+	.quad	0x03e6d9ce1d316eb93		# 5.51581770357780862071e-08
+	.quad	0x03e63562a19a9c442		# 3.60171867511861372793e-08
+	.quad	0x03e54e2adf548084c		# 1.94511067964296180547e-08
+	.quad	0x03e508ce55cc8c97a		# 1.54137376631349347838e-08
+	.quad	0x03e30e2f613e85bda		# 3.93171034490174464173e-09
+	.quad	0x03e6db03ebb0227bf		# 5.52990607758839766440e-08
+	.quad	0x03e61b75bb09cb098		# 3.29990737637586136511e-08
+	.quad	0x03e496f16abb9df22		# 1.18436010922446096216e-08
+	.quad	0x03e65b3f399411c62		# 4.04248680368301346709e-08
+	.quad	0x03e586b3e59f65355		# 2.27418915900284316293e-08
+	.quad	0x03e52482ceae1ac12		# 1.70263791333409206020e-08
+	.quad	0x03e6efa39ef35793c		# 5.76999904754328540596e-08
+	.quad 0					# for alignment
+

diff --git a/src/gas/vrdalog10.S b/src/gas/vrdalog10.S
new file mode 100644
index 0000000..f766b62
--- /dev/null
+++ b/src/gas/vrdalog10.S

@@ -0,0 +1,1021 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdalog10.s
+#
+# An array implementation of the log10 libm function.
+#
+# Prototype:
+#
+#    void vrda_log10(int n, double *x, double *y);
+#
+#   Computes the natural log10 of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.  This version can compute log10s in 50-55
+#   cycles with n <= 24
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+        .weak vrda_log10_
+        .set vrda_log10_,__vrda_log10__
+        .weak vrda_log10__
+        .set vrda_log10__,__vrda_log10__
+
+# parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array log10
+#**     VRDA_LOG(N,X,Y)
+# C equivalent*/
+#void vrda_log10__(int * n, double *x, double *y)
+#{
+#       vrda_log10(*n,x,y);
+#}
+.globl __vrda_log10__
+    .type   __vrda_log10__,@function
+__vrda_log10__:
+    mov         (%rdi),%edi
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+.equ	p_idx,0x010		# index storage
+.equ	p_xexp,0x020		# index storage
+
+.equ	p_x2,0x030		# temporary for error checking operation
+.equ	p_idx2,0x040		# index storage
+.equ	p_xexp2,0x050		# index storage
+
+.equ	save_xa,0x060		#qword
+.equ	save_ya,0x068		#qword
+.equ	save_nv,0x070		#qword
+.equ	p_iter,0x078		# qword	storage for number of loop iterations
+
+.equ	save_rbx,0x080		#qword
+
+
+.equ	p2_temp,0x090		# second temporary for get/put bits operation
+.equ	p2_temp1,0x0b0		# second temporary for exponent multiply
+
+.equ	p_n1,0x0c0		# temporary for near one check
+.equ	p_n12,0x0d0		# temporary for near one check
+
+
+.equ	stack_size,0x0e8
+
+
+# parameters are passed in by Microsoft C as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl vrda_log10
+    .type   vrda_log10,@function
+vrda_log10:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$2,%rax			# get number of iterations
+	jz		.L__vda_cleanup		# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi		# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movlpd	(%rsi),%xmm0
+	movhpd	8(%rsi),%xmm0
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+                movlpd  -16(%rsi),%xmm7
+                movhpd  -8(%rsi),%xmm7
+
+# compute the log10s
+
+##  if NaN or inf
+	movdqa	%xmm0,p_x(%rsp)	# save the input values
+
+#      /* Store the exponent of x in xexp and put
+#         f into the range [0.5,1) */
+
+	pxor	%xmm1,%xmm1
+	movdqa	%xmm0,%xmm3
+	psrlq	$52,%xmm3
+	psubq	.L__mask_1023(%rip),%xmm3
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm6			# xexp
+		movdqa	%xmm7,p_x2(%rsp)	# save the input values
+	movdqa	%xmm0,%xmm2
+	subpd	.L__real_one(%rip),%xmm2
+
+	movapd	%xmm6,p_xexp(%rsp)
+	andpd	.L__real_notsign(%rip),%xmm2
+	xor		%rax,%rax
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+
+	cmppd	$1,.L__real_threshold(%rip),%xmm2
+	movmskpd	%xmm2,%ecx
+	movdqa	%xmm3,%xmm4
+	mov			%ecx,p_n1(%rsp)
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrlq	$45,%xmm3
+	movdqa	%xmm3,%xmm2
+	psrlq	$1,%xmm3
+	paddq	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm2
+	paddq	%xmm2,%xmm3
+
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm1
+		pxor	%xmm7,%xmm7
+		movdqa	p_x2(%rsp),%xmm2
+		movapd	p_x2(%rsp),%xmm5
+		psrlq	$52,%xmm2
+		psubq	.L__mask_1023(%rip),%xmm2
+		packssdw	%xmm7,%xmm2
+		subpd	.L__real_one(%rip),%xmm5
+		andpd	.L__real_notsign(%rip),%xmm5
+		cvtdq2pd	%xmm2,%xmm6			# xexp
+	xor		%rcx,%rcx
+		cmppd	$1,.L__real_threshold(%rip),%xmm5
+	movq	 %xmm3,p_idx(%rsp)
+
+# reduce and get u
+	por		.L__real_half(%rip),%xmm4
+	movdqa	%xmm4,%xmm2
+		movapd	%xmm6,p_xexp2(%rsp)
+
+	# do near one check
+		movmskpd	%xmm5,%edx
+		mov			%edx,p_n12(%rsp)
+
+	mulpd	.L__real_3f80000000000000(%rip),%xmm1				# f1 = index/128
+
+
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	mov		p_idx(%rsp),%eax
+		movdqa	p_x2(%rsp),%xmm6
+
+	movapd	.L__real_half(%rip),%xmm5							# .5
+	subpd	%xmm1,%xmm2											# f2 = f - f1
+		pand	.L__real_mant(%rip),%xmm6
+	mulpd	%xmm2,%xmm5
+	addpd	%xmm5,%xmm1
+
+		movdqa	%xmm6,%xmm8
+		psrlq	$45,%xmm6
+		movdqa	%xmm6,%xmm4
+
+		psrlq	$1,%xmm6
+		paddq	.L__mask_040(%rip),%xmm6
+		pand	.L__mask_001(%rip),%xmm4
+		paddq	%xmm4,%xmm6
+# do error checking here for scheduling.  Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+##  if NaN or inf
+	movapd	%xmm0,%xmm3
+	andpd	.L__real_inf(%rip),%xmm3
+	cmppd	$0,.L__real_inf(%rip),%xmm3
+	movmskpd	%xmm3,%r8d
+		packssdw	%xmm7,%xmm6
+		por		.L__real_half(%rip),%xmm8
+		movq	 %xmm6,p_idx2(%rsp)
+		cvtdq2pd	%xmm6,%xmm9
+
+	cmppd	$2,.L__real_zero(%rip),%xmm0
+		mulpd	.L__real_3f80000000000000(%rip),%xmm9				# f1 = index/128
+	movmskpd	%xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+	divpd	%xmm1,%xmm2				# u
+
+# compute the index into the log10 tables
+#
+
+        movlpd   -512(%rdx,%rax,8),%xmm0                # z1
+        mov             p_idx+4(%rsp),%ecx
+        movhpd   -512(%rdx,%rcx,8),%xmm0                # z1
+# solve for ln(1+u)
+	movapd	%xmm2,%xmm1				# u
+	mulpd	%xmm2,%xmm2				# u^2
+	movapd	%xmm2,%xmm5
+	movapd	.L__real_cb3(%rip),%xmm3
+	mulpd	%xmm2,%xmm3				#Cu2
+	mulpd	%xmm1,%xmm5				# u^3
+	addpd	.L__real_cb2(%rip),%xmm3 #B+Cu2
+
+	mulpd	%xmm5,%xmm2				# u^5
+	movapd	.L__real_log2_lead(%rip),%xmm4
+
+	mulpd	.L__real_cb1(%rip),%xmm5 #Au3
+	addpd	%xmm5,%xmm1				# u+Au3
+	mulpd	%xmm3,%xmm2				# u5(B+Cu2)
+
+	movapd	p_xexp(%rsp),%xmm5		# xexp
+	addpd	%xmm2,%xmm1				# poly
+# recombine
+	mulpd	%xmm5,%xmm4				# xexp * log2_lead
+	addpd	%xmm4,%xmm0				#r1
+	movapd  %xmm0,%xmm2				#for log10
+	lea		.L__np_ln_tail_table(%rip),%rdx
+        movlpd   -512(%rdx,%rax,8),%xmm4                #z2     +=q
+        movhpd   -512(%rdx,%rcx,8),%xmm4                #z2     +=q
+        mulpd 	.L__real_log10e_tail(%rip),%xmm0	#for log10
+	    mulpd 	.L__real_log10e_lead(%rip),%xmm2  #for log10
+		lea		.L__np_ln_lead_table(%rip),%rdx
+		mov		p_idx2(%rsp),%eax
+		mov		p_idx2+4(%rsp),%ecx
+	addpd	%xmm4,%xmm1
+
+	mulpd	.L__real_log2_tail(%rip),%xmm5
+
+		movapd	.L__real_half(%rip),%xmm4							# .5
+		subpd	%xmm9,%xmm8											# f2 = f - f1
+		mulpd	%xmm8,%xmm4
+		addpd	%xmm4,%xmm9
+
+	addpd	%xmm5,%xmm1				#r2
+		movapd  %xmm1,%xmm7				#for log10
+		mulpd 	.L__real_log10e_tail(%rip),%xmm1 #for log10
+		addpd   %xmm1,%xmm0		#for log10
+
+		divpd	%xmm9,%xmm8				# u
+		movapd	p_x2(%rsp),%xmm3
+		mulpd   .L__real_log10e_lead(%rip),%xmm7 #log10
+		andpd	.L__real_inf(%rip),%xmm3
+		cmppd	$0,.L__real_inf(%rip),%xmm3
+		movmskpd	%xmm3,%r10d
+		addpd   %xmm7,%xmm0		#for log10
+		movapd	p_x2(%rsp),%xmm6
+		cmppd	$2,.L__real_zero(%rip),%xmm6
+		movmskpd	%xmm6,%r11d
+
+# check for nans/infs
+	test		$3,%r8d
+	addpd  %xmm2,%xmm0	#for log10
+#	addpd	%xmm1,%xmm0
+	jnz		.L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+	test		$3,%r9d
+	jnz		.L__z_or_n
+
+.L__vlog2:
+# store the result _m128d
+	mov		save_ya(%rsp),%rdi	# get y_array pointer
+	movlpd	%xmm0,(%rdi)
+	movhpd	 %xmm0,8(%rdi)
+
+	# It seems like a good idea to try and interleave
+	# even more of the following code sooner into the
+	# program.  But there were conflicts with the table
+	# index registers, making the problem difficult.
+	# After a lot of work in a branch of this file,
+	# I was not able to match the speed of this version.
+	# CodeAnalyst shows that there is lots of unused add
+	# pipe time around the divides, but the processor
+	# doesn't seem to be able to schedule in those slots.
+
+	        movlpd   -512(%rdx,%rax,8),%xmm7                #z2     +=q
+        	movhpd   -512(%rdx,%rcx,8),%xmm7                #z2     +=q
+
+# check for near one
+	mov			p_n1(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one1
+.L__vlog2n:
+
+	# solve for ln(1+u)
+		movapd	%xmm8,%xmm9				# u
+		mulpd	%xmm8,%xmm8				# u^2
+		movapd	%xmm8,%xmm5
+		movapd	.L__real_cb3(%rip),%xmm3
+		mulpd	%xmm8,%xmm3				#Cu2
+		mulpd	%xmm9,%xmm5				# u^3
+		addpd	.L__real_cb2(%rip),%xmm3 		#B+Cu2
+
+		mulpd	%xmm5,%xmm8				# u^5
+		movapd	.L__real_log2_lead(%rip),%xmm4
+
+		mulpd	.L__real_cb1(%rip),%xmm5 		#Au3
+		addpd	%xmm5,%xmm9				# u+Au3
+		mulpd	%xmm3,%xmm8				# u5(B+Cu2)
+
+		movapd	p_xexp2(%rsp),%xmm5			# xexp
+		addpd	%xmm8,%xmm9				# poly
+	# recombine
+		mulpd	%xmm5,%xmm4
+		addpd	%xmm4,%xmm7				#r1
+		movapd 	%xmm7,%xmm6			#for log10
+
+		lea		.L__np_ln_tail_table(%rip),%rdx
+		mulpd 	.L__real_log10e_tail(%rip),%xmm7 #for log10
+	        movlpd   -512(%rdx,%rax,8),%xmm2                #z2     +=q
+	    mulpd 	.L__real_log10e_lead(%rip),%xmm6 #for log10
+        	movhpd   -512(%rdx,%rcx,8),%xmm2                #z2     +=q
+		addpd	%xmm2,%xmm9
+
+		mulpd	.L__real_log2_tail(%rip),%xmm5
+
+		addpd	%xmm5,%xmm9				#r2
+		movapd  %xmm9,%xmm8				#for log10
+		mulpd 	.L__real_log10e_tail(%rip),%xmm9 #for log 10
+		addpd   %xmm9,%xmm7 	#for log10
+		mulpd 	.L__real_log10e_lead(%rip),%xmm8 #for log10
+		addpd	%xmm8,%xmm7 	#for log10
+
+	# check for nans/infs
+		test		$3,%r10d
+		addpd   %xmm6,%xmm7	#for log10
+#		addpd	%xmm9,%xmm7
+		jnz		.L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+		test		$3,%r11d
+		jnz		.L__z_or_n2
+
+.L__vlog4:
+	mov			p_n12(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one2
+
+.L__vlog4n:
+
+
+#__vda_bottom2:
+
+	prefetch	64(%rdi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+		movlpd	%xmm7,-16(%rdi)
+		movhpd	%xmm7,-8(%rdi)
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vda_top
+
+
+# see if we need to do any extras
+	mov		save_nv(%rsp),%rax	# get number of values
+	test		%rax,%rax
+	jnz		.L__vda_cleanup
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+	.align	16
+.Lboth_nearone:
+# saves 10 cycles
+#      r = x - 1.0;
+	movapd	.L__real_two(%rip),%xmm2
+	subpd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addpd	%xmm0,%xmm2
+	movapd	%xmm0,%xmm1
+	divpd	%xmm2,%xmm1		# u
+	movapd	.L__real_ca4(%rip),%xmm4	  #D
+	movapd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movapd	%xmm0,%xmm6
+	mulpd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addpd	%xmm1,%xmm1		#u
+	movapd	%xmm1,%xmm2
+	mulpd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulpd	%xmm1,%xmm5		# Cu
+	movapd	%xmm1,%xmm3
+	mulpd	%xmm2,%xmm3		# u^3
+	mulpd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulpd	%xmm3,%xmm4		#Du^3
+
+	addpd	.L__real_ca1(%rip),%xmm2	# +A
+	movapd	%xmm3,%xmm1
+	mulpd	%xmm1,%xmm1		# u^6
+	addpd	%xmm4,%xmm5		#Cu+Du3
+
+	mulpd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulpd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addpd	%xmm1,%xmm2
+	subpd	%xmm6,%xmm2		# -correction
+
+#	loge to log10
+	movapd 	%xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subpd	%xmm3,%xmm0
+	addpd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movapd 	%xmm3,%xmm0
+	movapd	%xmm2,%xmm1
+
+	mulpd 	.L__real_log10e_tail(%rip),%xmm2
+	mulpd 	.L__real_log10e_tail(%rip),%xmm0
+	mulpd 	.L__real_log10e_lead(%rip),%xmm1
+	mulpd 	.L__real_log10e_lead(%rip),%xmm3
+	addpd 	%xmm2,%xmm0
+	addpd 	%xmm1,%xmm0
+	addpd	%xmm3,%xmm0
+
+#      return r + r2;
+#	addpd	%xmm2,%xmm0
+	ret
+
+	.align	16
+.L__near_one1:
+	cmp	$3,%r9d
+	jnz		.L__n1nb1
+
+	movapd	p_x(%rsp),%xmm0
+	call	.Lboth_nearone
+	movlpd	%xmm0,(%rdi)
+	movhpd	%xmm0,8(%rdi)
+	jmp		.L__vlog2n
+
+	.align	16
+.L__n1nb1:
+	test	$1,%r9d
+	jz		.L__lnn12
+
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,(%rdi)
+
+.L__lnn12:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn1e
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,8(%rdi)
+
+.L__lnn1e:
+	jmp		.L__vlog2n
+
+
+	.align	16
+.L__near_one2:
+	cmp	$3,%r9d
+	jnz		.L__n1nb2
+
+	movapd	p_x2(%rsp),%xmm0
+	call	.Lboth_nearone
+	movapd	%xmm0,%xmm7
+	jmp		.L__vlog4n
+
+	.align	16
+.L__n1nb2:
+	test	$1,%r9d
+	jz		.L__lnn22
+
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__ln1
+	movsd	%xmm0,%xmm7
+
+.L__lnn22:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn2e
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__ln1
+	movlhps	%xmm0,%xmm7
+
+.L__lnn2e:
+	jmp		.L__vlog4n
+
+	.align	16
+
+.L__ln1:
+# saves 10 cycles
+#      r = x - 1.0;
+	movlpd	.L__real_two(%rip),%xmm2
+	subsd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addsd	%xmm0,%xmm2
+	movsd	%xmm0,%xmm1
+	divsd	%xmm2,%xmm1		# u
+	movlpd	.L__real_ca4(%rip),%xmm4	  #D
+	movlpd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movsd	%xmm0,%xmm6
+	mulsd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addsd	%xmm1,%xmm1		#u
+	movsd	%xmm1,%xmm2
+	mulsd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulsd	%xmm1,%xmm5		# Cu
+	movsd	%xmm1,%xmm3
+	mulsd	%xmm2,%xmm3		# u^3
+	mulsd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulsd	%xmm3,%xmm4		#Du^3
+
+	addsd	.L__real_ca1(%rip),%xmm2	# +A
+	movsd	%xmm3,%xmm1
+	mulsd	%xmm1,%xmm1		# u^6
+	addsd	%xmm4,%xmm5		#Cu+Du3
+
+	mulsd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulsd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addsd	%xmm1,%xmm2
+	subsd	%xmm6,%xmm2		# -correction
+
+#	loge to log10
+	movsd 	%xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subsd	%xmm3,%xmm0
+	addsd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movsd 	%xmm3,%xmm0
+	movsd	%xmm2,%xmm1
+
+	mulsd 	.L__real_log10e_tail(%rip),%xmm2
+	mulsd 	.L__real_log10e_tail(%rip),%xmm0
+	mulsd 	.L__real_log10e_lead(%rip),%xmm1
+	mulsd 	.L__real_log10e_lead(%rip),%xmm3
+	addsd 	%xmm2,%xmm0
+	addsd 	%xmm1,%xmm0
+	addsd	%xmm3,%xmm0
+
+#      return r + r2;
+#	addsd	%xmm2,%xmm0
+	ret
+
+	.align	16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+	test		$1,%r8d		# first number?
+	jz		.L__lninf2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rdx
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+	test		$2,%r8d		# second number?
+	jz		.L__lninfe
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rdx
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+	jmp		.L__vlog1		# continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+	test		$1,%r10d		# first number?
+	jz		.L__lninf22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm7,%xmm1		# save the inputs
+	mov		p_x2(%rsp),%rdx
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm7,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+	movapd	%xmm0,%xmm7
+
+.L__lninf22:
+	test		$2,%r10d		# second number?
+	jz		.L__lninfe2
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rdx
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+	jmp		.L__vlog3		# continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__lnan					# jump if mantissa not zero, so it's a NaN
+# inf
+	rcl		$1,%rdx
+	jnc		.L__lne2					# log(+inf) = inf
+# negative x
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+#NaN
+.L__lnan:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__lne:
+	movd	%rdx,%xmm0
+.L__lne2:
+	ret
+
+	.align	16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+	test		$1,%r9d		# first number?
+	jz		.L__zn2
+
+	mov		%rax,p2_temp(%rsp)
+ 	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+	test		$2,%r9d		# second number?
+	jz		.L__zne
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne:
+	jmp		.L__vlog2
+
+.L__z_or_n2:
+	test		$1,%r11d		# first number?
+	jz		.L__zn22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm7,%xmm0
+	movapd	%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+	test		$2,%r11d		# second number?
+	jz		.L__zne2
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+	jmp		.L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+	shl		$1,%rax
+	jnz		.L__zn_x		 # if just a carry, then must be negative
+	movlpd	.L__real_ninf(%rip),%xmm0  # C99 specs -inf for +-0
+	ret
+.L__zn_x:
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+#  we assume that rdx is pointing at the next x array element,
+#  r8 at the next y array element.  The number of values left is in
+#  save_nv
+.L__vda_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__finish         # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorpd		%xmm0,%xmm0
+	movlpd	%xmm0,p_x+8(%rsp)
+	movapd	%xmm0,p_x+16(%rsp)
+
+	mov		(%rsi),%rcx			# we know there's at least one
+	mov	 	%rcx,p_x(%rsp)
+	cmp		$2,%rax
+	jl		.L__vdacg
+
+	mov		8(%rsi),%rcx			# do the second value
+	mov	 	%rcx,p_x+8(%rsp)
+	cmp		$3,%rax
+	jl		.L__vdacg
+
+	mov		16(%rsi),%rcx			# do the third value
+	mov	 	%rcx,p_x+16(%rsp)
+
+.L__vdacg:
+	mov		$4,%rdi				# parameter for N
+	lea		p_x(%rsp),%rsi		# &x parameter
+	lea		p2_temp(%rsp),%rdx	# &y parameter
+	call	vrda_log10@PLT			# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		save_ya(%rsp),%rdi
+	mov		save_nv(%rsp),%rax	# get number of values
+	mov	 	p2_temp(%rsp),%rcx
+	mov		%rcx,(%rdi)			# we know there's at least one
+	cmp		$2,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+8(%rsp),%rcx
+	mov		%rcx,8(%rdi)			# do the second value
+	cmp		$3,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+16(%rsp),%rcx
+	mov		%rcx,16(%rdi)			# do the third value
+
+.L__vdacgf:
+	jmp		.L__finish
+
+	.data
+	.align	64
+
+.L__real_one:			.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_two:			.quad 0x04000000000000000	# 2.0
+				.quad 0x04000000000000000
+.L__real_ninf:			.quad 0x0fff0000000000000	# -inf
+				.quad 0x0fff0000000000000
+.L__real_inf:			.quad 0x07ff0000000000000	# +inf
+				.quad 0x07ff0000000000000
+.L__real_nan:			.quad 0x07ff8000000000000	# NaN
+				.quad 0x07ff8000000000000
+
+.L__real_zero:			.quad 0x00000000000000000	# 0.0
+				.quad 0x00000000000000000
+
+.L__real_sign:			.quad 0x08000000000000000	# sign bit
+				.quad 0x08000000000000000
+.L__real_notsign:		.quad 0x07ffFFFFFFFFFFFFF	# ^sign bit
+				.quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold:		.quad 0x03FB082C000000000	# .064495086669921875 Threshold
+				.quad 0x03FB082C000000000
+.L__real_qnanbit:		.quad 0x00008000000000000	# quiet nan bit
+				.quad 0x00008000000000000
+.L__real_mant:			.quad 0x0000FFFFFFFFFFFFF	# mantissa bits
+				.quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000:	.quad 0x03f80000000000000	# /* 0.0078125 = 1/128 */
+				.quad 0x03f80000000000000
+.L__mask_1023:			.quad 0x000000000000003ff	#
+				.quad 0x000000000000003ff
+.L__mask_040:			.quad 0x00000000000000040	#
+				.quad 0x00000000000000040
+.L__mask_001:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001
+
+.L__real_ca1:			.quad 0x03fb55555555554e6	# 8.33333333333317923934e-02
+				.quad 0x03fb55555555554e6
+.L__real_ca2:			.quad 0x03f89999999bac6d4	# 1.25000000037717509602e-02
+				.quad 0x03f89999999bac6d4
+.L__real_ca3:			.quad 0x03f62492307f1519f	# 2.23213998791944806202e-03
+				.quad 0x03f62492307f1519f
+.L__real_ca4:			.quad 0x03f3c8034c85dfff0	# 4.34887777707614552256e-04
+				.quad 0x03f3c8034c85dfff0
+
+.L__real_cb1:			.quad 0x03fb5555555555557	# 8.33333333333333593622e-02
+				.quad 0x03fb5555555555557
+.L__real_cb2:			.quad 0x03f89999999865ede	# 1.24999999978138668903e-02
+				.quad 0x03f89999999865ede
+.L__real_cb3:			.quad 0x03f6249423bd94741	# 2.23219810758559851206e-03
+				.quad 0x03f6249423bd94741
+.L__real_log2_lead:  		.quad 0x03fe62e42e0000000	# log2_lead	  6.93147122859954833984e-01
+				.quad 0x03fe62e42e0000000
+.L__real_log2_tail: 		.quad 0x03e6efa39ef35793c	# log2_tail	  5.76999904754328540596e-08
+				.quad 0x03e6efa39ef35793c
+
+.L__real_half:			.quad 0x03fe0000000000000	# 1/2
+				.quad 0x03fe0000000000000
+
+.L__real_log10e_lead:	.quad 0x03fdbcb7800000000	# log10e_lead 4.34293746948242187500e-01
+				.quad 0x03fdbcb7800000000
+.L__real_log10e_tail:	.quad 0x03ea8a93728719535	# log10e_tail 7.3495500964015109100644e-7
+				.quad 0x03ea8a93728719535
+
+.L__mask_lower:			.quad 0x0ffffffff00000000
+				.quad 0x0ffffffff00000000
+
+.L__np_ln_lead_table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3f8fc0a800000000		# 1.55041813850402832031e-02
+	.quad	0x3f9f829800000000		# 3.07716131210327148438e-02
+	.quad	0x3fa7745800000000		# 4.58095073699951171875e-02
+	.quad	0x3faf0a3000000000		# 6.06245994567871093750e-02
+	.quad	0x3fb341d700000000		# 7.52233862876892089844e-02
+	.quad	0x3fb6f0d200000000		# 8.96121263504028320312e-02
+	.quad	0x3fba926d00000000		# 1.03796780109405517578e-01
+	.quad	0x3fbe270700000000		# 1.17783010005950927734e-01
+	.quad	0x3fc0d77e00000000		# 1.31576299667358398438e-01
+	.quad	0x3fc2955280000000		# 1.45181953907012939453e-01
+	.quad	0x3fc44d2b00000000		# 1.58604979515075683594e-01
+	.quad	0x3fc5ff3000000000		# 1.71850204467773437500e-01
+	.quad	0x3fc7ab8900000000		# 1.84922337532043457031e-01
+	.quad	0x3fc9525a80000000		# 1.97825729846954345703e-01
+	.quad	0x3fcaf3c900000000		# 2.10564732551574707031e-01
+	.quad	0x3fcc8ff780000000		# 2.23143517971038818359e-01
+	.quad	0x3fce270700000000		# 2.35566020011901855469e-01
+	.quad	0x3fcfb91800000000		# 2.47836112976074218750e-01
+	.quad	0x3fd0a324c0000000		# 2.59957492351531982422e-01
+	.quad	0x3fd1675c80000000		# 2.71933674812316894531e-01
+	.quad	0x3fd22941c0000000		# 2.83768117427825927734e-01
+	.quad	0x3fd2e8e280000000		# 2.95464158058166503906e-01
+	.quad	0x3fd3a64c40000000		# 3.07025015354156494141e-01
+	.quad	0x3fd4618bc0000000		# 3.18453729152679443359e-01
+	.quad	0x3fd51aad80000000		# 3.29753279685974121094e-01
+	.quad	0x3fd5d1bd80000000		# 3.40926527976989746094e-01
+	.quad	0x3fd686c800000000		# 3.51976394653320312500e-01
+	.quad	0x3fd739d7c0000000		# 3.62905442714691162109e-01
+	.quad	0x3fd7eaf800000000		# 3.73716354370117187500e-01
+	.quad	0x3fd89a3380000000		# 3.84411692619323730469e-01
+	.quad	0x3fd9479400000000		# 3.94993782043457031250e-01
+	.quad	0x3fd9f323c0000000		# 4.05465066432952880859e-01
+	.quad	0x3fda9cec80000000		# 4.15827870368957519531e-01
+	.quad	0x3fdb44f740000000		# 4.26084339618682861328e-01
+	.quad	0x3fdbeb4d80000000		# 4.36236739158630371094e-01
+	.quad	0x3fdc8ff7c0000000		# 4.46287095546722412109e-01
+	.quad	0x3fdd32fe40000000		# 4.56237375736236572266e-01
+	.quad	0x3fddd46a00000000		# 4.66089725494384765625e-01
+	.quad	0x3fde744240000000		# 4.75845873355865478516e-01
+	.quad	0x3fdf128f40000000		# 4.85507786273956298828e-01
+	.quad	0x3fdfaf5880000000		# 4.95077252388000488281e-01
+	.quad	0x3fe02552a0000000		# 5.04556000232696533203e-01
+	.quad	0x3fe0723e40000000		# 5.13945698738098144531e-01
+	.quad	0x3fe0be72e0000000		# 5.23248136043548583984e-01
+	.quad	0x3fe109f380000000		# 5.32464742660522460938e-01
+	.quad	0x3fe154c3c0000000		# 5.41597247123718261719e-01
+	.quad	0x3fe19ee6a0000000		# 5.50647079944610595703e-01
+	.quad	0x3fe1e85f40000000		# 5.59615731239318847656e-01
+	.quad	0x3fe23130c0000000		# 5.68504691123962402344e-01
+	.quad	0x3fe2795e00000000		# 5.77315330505371093750e-01
+	.quad	0x3fe2c0e9e0000000		# 5.86049020290374755859e-01
+	.quad	0x3fe307d720000000		# 5.94707071781158447266e-01
+	.quad	0x3fe34e2880000000		# 6.03290796279907226562e-01
+	.quad	0x3fe393e0c0000000		# 6.11801505088806152344e-01
+	.quad	0x3fe3d90260000000		# 6.20240390300750732422e-01
+	.quad	0x3fe41d8fe0000000		# 6.28608644008636474609e-01
+	.quad	0x3fe4618bc0000000		# 6.36907458305358886719e-01
+	.quad	0x3fe4a4f840000000		# 6.45137906074523925781e-01
+	.quad	0x3fe4e7d800000000		# 6.53301239013671875000e-01
+	.quad	0x3fe52a2d20000000		# 6.61398470401763916016e-01
+	.quad	0x3fe56bf9c0000000		# 6.69430613517761230469e-01
+	.quad	0x3fe5ad4040000000		# 6.77398800849914550781e-01
+	.quad	0x3fe5ee02a0000000		# 6.85303986072540283203e-01
+	.quad	0x3fe62e42e0000000		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_tail_table:
+	.quad	0x00000000000000000 # 0	; 0.00000000000000000000e+00
+	.quad	0x03e361f807c79f3db		# 5.15092497094772879206e-09
+	.quad	0x03e6873c1980267c8		# 4.55457209735272790188e-08
+	.quad	0x03e5ec65b9f88c69e		# 2.86612990859791781788e-08
+	.quad	0x03e58022c54cc2f99		# 2.23596477332056055352e-08
+	.quad	0x03e62c37a3a125330		# 3.49498983167142274770e-08
+	.quad	0x03e615cad69737c93		# 3.23392843005887000414e-08
+	.quad	0x03e4d256ab1b285e9		# 1.35722380472479366661e-08
+	.quad	0x03e5b8abcb97a7aa2		# 2.56504325268044191098e-08
+	.quad	0x03e6f34239659a5dc		# 5.81213608741512136843e-08
+	.quad	0x03e6e07fd48d30177		# 5.59374849578288093334e-08
+	.quad	0x03e6b32df4799f4f6		# 5.06615629004996189970e-08
+	.quad	0x03e6c29e4f4f21cf8		# 5.24588857848400955725e-08
+	.quad	0x03e1086c848df1b59		# 9.61968535632653505972e-10
+	.quad	0x03e4cf456b4764130		# 1.34829655346594463137e-08
+	.quad	0x03e63a02ffcb63398		# 3.65557749306383026498e-08
+	.quad	0x03e61e6a6886b0976		# 3.33431709374069198903e-08
+	.quad	0x03e6b8abcb97a7aa2		# 5.13008650536088382197e-08
+	.quad	0x03e6b578f8aa35552		# 5.09285070380306053751e-08
+	.quad	0x03e6139c871afb9fc		# 3.20853940845502057341e-08
+	.quad	0x03e65d5d30701ce64		# 4.06713248643004200446e-08
+	.quad	0x03e6de7bcb2d12142		# 5.57028186706125221168e-08
+	.quad	0x03e6d708e984e1664		# 5.48356693724804282546e-08
+	.quad	0x03e556945e9c72f36		# 1.99407553679345001938e-08
+	.quad	0x03e20e2f613e85bda		# 1.96585517245087232086e-09
+	.quad	0x03e3cb7e0b42724f6		# 6.68649386072067321503e-09
+	.quad	0x03e6fac04e52846c7		# 5.89936034642113390002e-08
+	.quad	0x03e5e9b14aec442be		# 2.85038578721554472484e-08
+	.quad	0x03e6b5de8034e7126		# 5.09746772910284482606e-08
+	.quad	0x03e6dc157e1b259d3		# 5.54234668933210171467e-08
+	.quad	0x03e3b05096ad69c62		# 6.29100830926604004874e-09
+	.quad	0x03e5c2116faba4cdd		# 2.61974119468563937716e-08
+	.quad	0x03e665fcc25f95b47		# 4.16752115011186398935e-08
+	.quad	0x03e5a9a08498d4850		# 2.47747534460820790327e-08
+	.quad	0x03e6de647b1465f77		# 5.56922172017964209793e-08
+	.quad	0x03e5da71b7bf7861d		# 2.76162876992552906035e-08
+	.quad	0x03e3e6a6886b09760		# 7.08169709942321478061e-09
+	.quad	0x03e6f0075eab0ef64		# 5.77453510221151779025e-08
+	.quad	0x03e33071282fb989b		# 4.43021445893361960146e-09
+	.quad	0x03e60eb43c3f1bed2		# 3.15140984357495864573e-08
+	.quad	0x03e5faf06ecb35c84		# 2.95077445089736670973e-08
+	.quad	0x03e4ef1e63db35f68		# 1.44098510263167149349e-08
+	.quad	0x03e469743fb1a71a5		# 1.05196987538551827693e-08
+	.quad	0x03e6c1cdf404e5796		# 5.23641361722697546261e-08
+	.quad	0x03e4094aa0ada625e		# 7.72099925253243069458e-09
+	.quad	0x03e6e2d4c96fde3ec		# 5.62089493829364197156e-08
+	.quad	0x03e62f4d5e9a98f34		# 3.53090261098577946927e-08
+	.quad	0x03e6467c96ecc5cbe		# 3.80080516835568242269e-08
+	.quad	0x03e6e7040d03dec5a		# 5.66961038386146408282e-08
+	.quad	0x03e67bebf4282de36		# 4.42287063097349852717e-08
+	.quad	0x03e6289b11aeb783f		# 3.45294525105681104660e-08
+	.quad	0x03e5a891d1772f538		# 2.47132034530447431509e-08
+	.quad	0x03e634f10be1fb591		# 3.59655343422487209774e-08
+	.quad	0x03e6d9ce1d316eb93		# 5.51581770357780862071e-08
+	.quad	0x03e63562a19a9c442		# 3.60171867511861372793e-08
+	.quad	0x03e54e2adf548084c		# 1.94511067964296180547e-08
+	.quad	0x03e508ce55cc8c97a		# 1.54137376631349347838e-08
+	.quad	0x03e30e2f613e85bda		# 3.93171034490174464173e-09
+	.quad	0x03e6db03ebb0227bf		# 5.52990607758839766440e-08
+	.quad	0x03e61b75bb09cb098		# 3.29990737637586136511e-08
+	.quad	0x03e496f16abb9df22		# 1.18436010922446096216e-08
+	.quad	0x03e65b3f399411c62		# 4.04248680368301346709e-08
+	.quad	0x03e586b3e59f65355		# 2.27418915900284316293e-08
+	.quad	0x03e52482ceae1ac12		# 1.70263791333409206020e-08
+	.quad	0x03e6efa39ef35793c		# 5.76999904754328540596e-08
+	.quad 0					# for alignment
+

diff --git a/src/gas/vrdalog2.S b/src/gas/vrdalog2.S
new file mode 100644
index 0000000..0200f03
--- /dev/null
+++ b/src/gas/vrdalog2.S

@@ -0,0 +1,1003 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdalog.s
+#
+# An array implementation of the log libm function.
+#
+# Prototype:
+#
+#    void vrda_log2(int n, double *x, double *y);
+#
+#   Computes the natural log of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.  This version can compute logs in 44
+#   cycles with n <= 24
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+.equ	p_idx,0x010		# index storage
+.equ	p_xexp,0x020		# index storage
+
+.equ	p_x2,0x030		# temporary for error checking operation
+.equ	p_idx2,0x040		# index storage
+.equ	p_xexp2,0x050		# index storage
+
+.equ	save_xa,0x060		#qword
+.equ	save_ya,0x068		#qword
+.equ	save_nv,0x070		#qword
+.equ	p_iter,0x078		# qword	storage for number of loop iterations
+
+.equ	save_rbx,0x080		#qword
+
+
+.equ	p2_temp,0x090		# second temporary for get/put bits operation
+.equ	p2_temp1,0x0b0		# second temporary for exponent multiply
+
+.equ	p_n1,0x0c0		# temporary for near one check
+.equ	p_n12,0x0d0		# temporary for near one check
+
+
+.equ	stack_size,0x0e8
+
+	.weak vrda_log2_
+	.set vrda_log2_,__vrda_log2__
+	.weak vrda_log2__
+	.set vrda_log2__,__vrda_log2__
+
+# parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array log
+#**     VRDA_LOG(N,X,Y)
+# C equivalent*/
+#void vrda_log2__(int * n, double *x, double *y)
+#{
+#       vrda_log2(*n,x,y);
+#}
+.globl __vrda_log2__
+    .type   __vrda_log2__,@function
+__vrda_log2__:
+    mov		(%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+.globl vrda_log2
+    .type   vrda_log2,@function
+vrda_log2:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$2,%rax			# get number of iterations
+	jz		.L__vda_cleanup		# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi		# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movlpd	(%rsi),%xmm0
+	movhpd	8(%rsi),%xmm0
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+                movlpd  -16(%rsi),%xmm7
+                movhpd  -8(%rsi),%xmm7
+
+# compute the logs
+
+##  if NaN or inf
+	movdqa	%xmm0,p_x(%rsp)	# save the input values
+
+#      /* Store the exponent of x in xexp and put
+#         f into the range [0.5,1) */
+
+	pxor	%xmm1,%xmm1
+	movdqa	%xmm0,%xmm3
+	psrlq	$52,%xmm3
+	psubq	.L__mask_1023(%rip),%xmm3
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm6			# xexp
+		movdqa	%xmm7,p_x2(%rsp)	# save the input values
+	movdqa	%xmm0,%xmm2
+	subpd	.L__real_one(%rip),%xmm2
+
+	movapd	%xmm6,p_xexp(%rsp)
+	andpd	.L__real_notsign(%rip),%xmm2
+	xor		%rax,%rax
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+
+	cmppd	$1,.L__real_threshold(%rip),%xmm2
+	movmskpd	%xmm2,%ecx
+	movdqa	%xmm3,%xmm4
+	mov			%ecx,p_n1(%rsp)
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrlq	$45,%xmm3
+	movdqa	%xmm3,%xmm2
+	psrlq	$1,%xmm3
+	paddq	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm2
+	paddq	%xmm2,%xmm3
+
+	packssdw	%xmm1,%xmm3
+	cvtdq2pd	%xmm3,%xmm1
+		pxor	%xmm7,%xmm7
+		movdqa	p_x2(%rsp),%xmm2
+		movapd	p_x2(%rsp),%xmm5
+		psrlq	$52,%xmm2
+		psubq	.L__mask_1023(%rip),%xmm2
+		packssdw	%xmm7,%xmm2
+		subpd	.L__real_one(%rip),%xmm5
+		andpd	.L__real_notsign(%rip),%xmm5
+		cvtdq2pd	%xmm2,%xmm6			# xexp
+	xor		%rcx,%rcx
+		cmppd	$1,.L__real_threshold(%rip),%xmm5
+	movq	 %xmm3,p_idx(%rsp)
+
+# reduce and get u
+	por		.L__real_half(%rip),%xmm4
+	movdqa	%xmm4,%xmm2
+		movapd	%xmm6,p_xexp2(%rsp)
+
+	# do near one check
+		movmskpd	%xmm5,%edx
+		mov			%edx,p_n12(%rsp)
+
+	mulpd	.L__real_3f80000000000000(%rip),%xmm1				# f1 = index/128
+
+
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	mov		p_idx(%rsp),%eax
+		movdqa	p_x2(%rsp),%xmm6
+
+	movapd	.L__real_half(%rip),%xmm5							# .5
+	subpd	%xmm1,%xmm2											# f2 = f - f1
+		pand	.L__real_mant(%rip),%xmm6
+	mulpd	%xmm2,%xmm5
+	addpd	%xmm5,%xmm1
+
+		movdqa	%xmm6,%xmm8
+		psrlq	$45,%xmm6
+		movdqa	%xmm6,%xmm4
+
+		psrlq	$1,%xmm6
+		paddq	.L__mask_040(%rip),%xmm6
+		pand	.L__mask_001(%rip),%xmm4
+		paddq	%xmm4,%xmm6
+# do error checking here for scheduling.  Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+##  if NaN or inf
+	movapd	%xmm0,%xmm3
+	andpd	.L__real_inf(%rip),%xmm3
+	cmppd	$0,.L__real_inf(%rip),%xmm3
+	movmskpd	%xmm3,%r8d
+		packssdw	%xmm7,%xmm6
+		por		.L__real_half(%rip),%xmm8
+		movq	 %xmm6,p_idx2(%rsp)
+		cvtdq2pd	%xmm6,%xmm9
+
+	cmppd	$2,.L__real_zero(%rip),%xmm0
+		mulpd	.L__real_3f80000000000000(%rip),%xmm9				# f1 = index/128
+	movmskpd	%xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+	divpd	%xmm1,%xmm2				# u
+
+# compute the index into the log tables
+#
+
+        movlpd   -512(%rdx,%rax,8),%xmm0                # z1
+        mov             p_idx+4(%rsp),%ecx
+        movhpd   -512(%rdx,%rcx,8),%xmm0                # z1
+# solve for ln(1+u)
+	movapd	%xmm2,%xmm1				# u
+	mulpd	%xmm2,%xmm2				# u^2
+	movapd	%xmm2,%xmm5
+	movapd	.L__real_cb3(%rip),%xmm3
+	mulpd	%xmm2,%xmm3				#Cu2
+	mulpd	%xmm1,%xmm5				# u^3
+	addpd	.L__real_cb2(%rip),%xmm3 #B+Cu2
+
+	mulpd	%xmm5,%xmm2				# u^5
+	movapd	.L__real_log2e_lead(%rip),%xmm4
+
+	mulpd	.L__real_cb1(%rip),%xmm5 #Au3
+	addpd	%xmm5,%xmm1				# u+Au3
+	movapd	%xmm0,%xmm5				#z1 copy
+	mulpd	%xmm3,%xmm2				# u5(B+Cu2)
+	movapd	.L__real_log2e_tail(%rip),%xmm3
+	movapd	p_xexp(%rsp),%xmm6		# xexp
+	addpd	%xmm2,%xmm1				# poly
+# recombine
+	lea		.L__np_ln_tail_table(%rip),%rdx
+        movlpd   -512(%rdx,%rax,8),%xmm2                #z2     +=q
+        movhpd   -512(%rdx,%rcx,8),%xmm2                #z2     +=q
+		lea		.L__np_ln_lead_table(%rip),%rdx
+		mov		p_idx2(%rsp),%eax
+		mov		p_idx2+4(%rsp),%ecx
+	addpd	%xmm2,%xmm1	#z2
+	movapd	%xmm1,%xmm2 #z2 copy
+
+
+	mulpd	%xmm4,%xmm5
+	mulpd	%xmm4,%xmm1
+		movapd	.L__real_half(%rip),%xmm4							# .5
+		subpd	%xmm9,%xmm8											# f2 = f - f1
+		mulpd	%xmm8,%xmm4
+		addpd	%xmm4,%xmm9
+	mulpd	%xmm3,%xmm2	#z2*log2e_tail
+	mulpd	%xmm3,%xmm0	#z1*log2e_tail
+	addpd	%xmm6,%xmm5	#r1 = z1*log2e_lead + xexp
+	addpd	%xmm2,%xmm0	#z1*log2e_tail + z2*log2e_tail
+	addpd	%xmm1,%xmm0				#r2
+
+		divpd	%xmm9,%xmm8				# u
+		movapd	p_x2(%rsp),%xmm3
+		andpd	.L__real_inf(%rip),%xmm3
+		cmppd	$0,.L__real_inf(%rip),%xmm3
+		movmskpd	%xmm3,%r10d
+		movapd	p_x2(%rsp),%xmm6
+		cmppd	$2,.L__real_zero(%rip),%xmm6
+		movmskpd	%xmm6,%r11d
+
+# check for nans/infs
+	test		$3,%r8d
+	addpd	%xmm5,%xmm0
+	jnz		.L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+	test		$3,%r9d
+	jnz		.L__z_or_n
+
+.L__vlog2:
+# store the result _m128d
+	mov		save_ya(%rsp),%rdi	# get y_array pointer
+	movlpd	%xmm0,(%rdi)
+	movhpd	 %xmm0,8(%rdi)
+
+	# It seems like a good idea to try and interleave
+	# even more of the following code sooner into the
+	# program.  But there were conflicts with the table
+	# index registers, making the problem difficult.
+	# After a lot of work in a branch of this file,
+	# I was not able to match the speed of this version.
+	# CodeAnalyst shows that there is lots of unused add
+	# pipe time around the divides, but the processor
+	# doesn't seem to be able to schedule in those slots.
+
+	        movlpd   -512(%rdx,%rax,8),%xmm7                #z2     +=q
+        	movhpd   -512(%rdx,%rcx,8),%xmm7                #z2     +=q
+
+# check for near one
+	mov			p_n1(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one1
+.L__vlog2n:
+
+	# solve for ln(1+u)
+		movapd	%xmm8,%xmm9				# u
+		mulpd	%xmm8,%xmm8				# u^2
+		movapd	%xmm8,%xmm5
+		movapd	.L__real_cb3(%rip),%xmm3
+		mulpd	%xmm8,%xmm3				#Cu2
+		mulpd	%xmm9,%xmm5				# u^3
+		addpd	.L__real_cb2(%rip),%xmm3 		#B+Cu2
+
+		mulpd	%xmm5,%xmm8				# u^5
+		movapd	.L__real_log2e_lead(%rip),%xmm4
+
+		mulpd	.L__real_cb1(%rip),%xmm5 		#Au3
+		addpd	%xmm5,%xmm9				# u+Au3
+		movapd	%xmm7,%xmm5				#z1 copy
+		mulpd	%xmm3,%xmm8				# u5(B+Cu2)
+		movapd	.L__real_log2e_tail(%rip),%xmm3
+		movapd	p_xexp2(%rsp),%xmm6			# xexp
+		addpd	%xmm8,%xmm9				# poly
+	# recombine
+		lea		.L__np_ln_tail_table(%rip),%rdx
+	        movlpd   -512(%rdx,%rax,8),%xmm2                #z2     +=q
+        	movhpd   -512(%rdx,%rcx,8),%xmm2                #z2     +=q
+		addpd	%xmm2,%xmm9		#z2
+		movapd	%xmm9,%xmm2		#z2 copy
+
+		mulpd	%xmm4,%xmm5	#z1*log2e_lead
+		mulpd	%xmm4,%xmm9	#z2*log2e_lead
+		mulpd	%xmm3,%xmm2	#z2*log2e_tail
+		mulpd	%xmm3,%xmm7	#z1*log2e_tail
+		addpd	%xmm6,%xmm5	#r1 = z1*log2e_lead + xexp
+		addpd	%xmm2,%xmm7	#z1*log2e_tail + z2*log2e_tail
+
+
+		addpd	%xmm9,%xmm7				#r2
+
+	# check for nans/infs
+		test		$3,%r10d
+		addpd	%xmm5,%xmm7
+		jnz		.L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+		test		$3,%r11d
+		jnz		.L__z_or_n2
+
+.L__vlog4:
+	mov			p_n12(%rsp),%r9d
+	test			$3,%r9d
+	jnz			.L__near_one2
+
+.L__vlog4n:
+
+
+#__vda_bottom2:
+
+	prefetch	64(%rdi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+		movlpd	%xmm7,-16(%rdi)
+		movhpd	%xmm7,-8(%rdi)
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vda_top
+
+
+# see if we need to do any extras
+	mov		save_nv(%rsp),%rax	# get number of values
+	test		%rax,%rax
+	jnz		.L__vda_cleanup
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+	.align	16
+.Lboth_nearone:
+# saves 10 cycles
+#      r = x - 1.0;
+	movapd	.L__real_two(%rip),%xmm2
+	subpd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addpd	%xmm0,%xmm2
+	movapd	%xmm0,%xmm1
+	divpd	%xmm2,%xmm1		# u
+	movapd	.L__real_ca4(%rip),%xmm4	  #D
+	movapd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movapd	%xmm0,%xmm6
+	mulpd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addpd	%xmm1,%xmm1		#u
+	movapd	%xmm1,%xmm2
+	mulpd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulpd	%xmm1,%xmm5		# Cu
+	movapd	%xmm1,%xmm3
+	mulpd	%xmm2,%xmm3		# u^3
+	mulpd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulpd	%xmm3,%xmm4		#Du^3
+
+	addpd	.L__real_ca1(%rip),%xmm2	# +A
+	movapd	%xmm3,%xmm1
+	mulpd	%xmm1,%xmm1		# u^6
+	addpd	%xmm4,%xmm5		#Cu+Du3
+
+	mulpd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulpd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addpd	%xmm1,%xmm2
+	subpd	%xmm6,%xmm2		# -correction
+
+#	loge to log2
+	movapd 	%xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subpd	%xmm3,%xmm0
+	addpd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movapd 	%xmm3,%xmm0
+	movapd	%xmm2,%xmm1
+
+	mulpd 	.L__real_log2e_tail(%rip),%xmm2
+	mulpd 	.L__real_log2e_tail(%rip),%xmm0
+	mulpd 	.L__real_log2e_lead(%rip),%xmm1
+	mulpd 	.L__real_log2e_lead(%rip),%xmm3
+	addpd 	%xmm2,%xmm0
+	addpd 	%xmm1,%xmm0
+	addpd	%xmm3,%xmm0
+#      return r + r2;
+#	addpd	%xmm2,%xmm0
+	ret
+
+	.align	16
+.L__near_one1:
+	cmp	$3,%r9d
+	jnz		.L__n1nb1
+
+	movapd	p_x(%rsp),%xmm0
+	call	.Lboth_nearone
+	movlpd	%xmm0,(%rdi)
+	movhpd	%xmm0,8(%rdi)
+	jmp		.L__vlog2n
+
+	.align	16
+.L__n1nb1:
+	test	$1,%r9d
+	jz		.L__lnn12
+
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,(%rdi)
+
+.L__lnn12:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn1e
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__ln1
+	movlpd	%xmm0,8(%rdi)
+
+.L__lnn1e:
+	jmp		.L__vlog2n
+
+
+	.align	16
+.L__near_one2:
+	cmp	$3,%r9d
+	jnz		.L__n1nb2
+
+	movapd	p_x2(%rsp),%xmm0
+	call	.Lboth_nearone
+	movapd	%xmm0,%xmm7
+	jmp		.L__vlog4n
+
+	.align	16
+.L__n1nb2:
+	test	$1,%r9d
+	jz		.L__lnn22
+
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__ln1
+	movsd	%xmm0,%xmm7
+
+.L__lnn22:
+	test	$2,%r9d		# second number?
+	jz		.L__lnn2e
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__ln1
+	movlhps	%xmm0,%xmm7
+
+.L__lnn2e:
+	jmp		.L__vlog4n
+
+	.align	16
+
+.L__ln1:
+# saves 10 cycles
+#      r = x - 1.0;
+	movlpd	.L__real_two(%rip),%xmm2
+	subsd	.L__real_one(%rip),%xmm0	   # r
+#      u          = r / (2.0 + r);
+	addsd	%xmm0,%xmm2
+	movsd	%xmm0,%xmm1
+	divsd	%xmm2,%xmm1		# u
+	movlpd	.L__real_ca4(%rip),%xmm4	  #D
+	movlpd	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movsd	%xmm0,%xmm6
+	mulsd	%xmm1,%xmm6		# correction
+#      u          = u + u;
+	addsd	%xmm1,%xmm1		#u
+	movsd	%xmm1,%xmm2
+	mulsd	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulsd	%xmm1,%xmm5		# Cu
+	movsd	%xmm1,%xmm3
+	mulsd	%xmm2,%xmm3		# u^3
+	mulsd	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulsd	%xmm3,%xmm4		#Du^3
+
+	addsd	.L__real_ca1(%rip),%xmm2	# +A
+	movsd	%xmm3,%xmm1
+	mulsd	%xmm1,%xmm1		# u^6
+	addsd	%xmm4,%xmm5		#Cu+Du3
+
+	mulsd	%xmm3,%xmm2		#u3(A+Bu2)
+	mulsd	%xmm5,%xmm1		#u6(Cu+Du3)
+	addsd	%xmm1,%xmm2
+	subsd	%xmm6,%xmm2		# -correction
+
+#	loge to log2
+	movsd 	%xmm0,%xmm3		#r1 = r
+	pand	.L__mask_lower(%rip),%xmm3
+	subsd	%xmm3,%xmm0
+	addsd 	%xmm0,%xmm2		#r2 = r2 + (r - r1);
+
+	movsd 	%xmm3,%xmm0
+	movsd	%xmm2,%xmm1
+
+	mulsd 	.L__real_log2e_tail(%rip),%xmm2
+	mulsd 	.L__real_log2e_tail(%rip),%xmm0
+	mulsd 	.L__real_log2e_lead(%rip),%xmm1
+	mulsd 	.L__real_log2e_lead(%rip),%xmm3
+	addsd 	%xmm2,%xmm0
+	addsd 	%xmm1,%xmm0
+	addsd	%xmm3,%xmm0
+
+#      return r + r2;
+#	addsd	%xmm2,%xmm0
+	ret
+
+	.align	16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+	test		$1,%r8d		# first number?
+	jz		.L__lninf2
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rdx
+	movlpd	p_x(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+	test		$2,%r8d		# second number?
+	jz		.L__lninfe
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rdx
+	movlpd	p_x+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+	jmp		.L__vlog1		# continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+	test		$1,%r10d		# first number?
+	jz		.L__lninf22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm7,%xmm1		# save the inputs
+	mov		p_x2(%rsp),%rdx
+	movlpd	p_x2(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$2,%xmm7,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+	movapd	%xmm0,%xmm7
+
+.L__lninf22:
+	test		$2,%r10d		# second number?
+	jz		.L__lninfe2
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rdx
+	movlpd	p_x2+8(%rsp),%xmm0
+	call	.L__lni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+	jmp		.L__vlog3		# continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+	mov		$0x0000FFFFFFFFFFFFF,%rax
+	test	%rax,%rdx
+	jnz		.L__lnan					# jump if mantissa not zero, so it's a NaN
+# inf
+	rcl		$1,%rdx
+	jnc		.L__lne2					# log(+inf) = inf
+# negative x
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+#NaN
+.L__lnan:
+	mov		$0x00008000000000000,%rax	# convert to quiet
+	or		%rax,%rdx
+.L__lne:
+	movd	%rdx,%xmm0
+.L__lne2:
+	ret
+
+	.align	16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+	test		$1,%r9d		# first number?
+	jz		.L__zn2
+
+	mov		%rax,p2_temp(%rsp)
+ 	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+	test		$2,%r9d		# second number?
+	jz		.L__zne
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	movapd	%xmm0,%xmm1		# save the inputs
+	mov		p_x+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm1
+	movapd	%xmm1,%xmm0
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne:
+	jmp		.L__vlog2
+
+.L__z_or_n2:
+	test		$1,%r11d		# first number?
+	jz		.L__zn22
+
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2(%rsp),%rax
+	call	.L__zni
+	shufpd	$2,%xmm7,%xmm0
+	movapd	%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+	test		$2,%r11d		# second number?
+	jz		.L__zne2
+	mov		%rax,p2_temp(%rsp)
+	mov		%rdx,p2_temp+8(%rsp)
+	mov		p_x2+8(%rsp),%rax
+	call	.L__zni
+	shufpd	$0,%xmm0,%xmm7
+	mov		p2_temp(%rsp),%rax
+	mov		p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+	jmp		.L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+	shl		$1,%rax
+	jnz		.L__zn_x		 # if just a carry, then must be negative
+	movlpd	.L__real_ninf(%rip),%xmm0  # C99 specs -inf for +-0
+	ret
+.L__zn_x:
+	movlpd	.L__real_nan(%rip),%xmm0
+	ret
+
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+#  we assume that rdx is pointing at the next x array element,
+#  r8 at the next y array element.  The number of values left is in
+#  save_nv
+.L__vda_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__finish         # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorpd		%xmm0,%xmm0
+	movlpd	%xmm0,p_x+8(%rsp)
+	movapd	%xmm0,p_x+16(%rsp)
+
+	mov		(%rsi),%rcx			# we know there's at least one
+	mov	 	%rcx,p_x(%rsp)
+	cmp		$2,%rax
+	jl		.L__vdacg
+
+	mov		8(%rsi),%rcx			# do the second value
+	mov	 	%rcx,p_x+8(%rsp)
+	cmp		$3,%rax
+	jl		.L__vdacg
+
+	mov		16(%rsi),%rcx			# do the third value
+	mov	 	%rcx,p_x+16(%rsp)
+
+.L__vdacg:
+	mov		$4,%rdi				# parameter for N
+	lea		p_x(%rsp),%rsi		# &x parameter
+	lea		p2_temp(%rsp),%rdx	# &y parameter
+	call		vrda_log2@PLT		# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		save_ya(%rsp),%rdi
+	mov		save_nv(%rsp),%rax	# get number of values
+	mov	 	p2_temp(%rsp),%rcx
+	mov		%rcx,(%rdi)			# we know there's at least one
+	cmp		$2,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+8(%rsp),%rcx
+	mov		%rcx,8(%rdi)			# do the second value
+	cmp		$3,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+16(%rsp),%rcx
+	mov		%rcx,16(%rdi)			# do the third value
+
+.L__vdacgf:
+	jmp		.L__finish
+
+	.data
+	.align	64
+
+.L__real_one:			.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_two:			.quad 0x04000000000000000	# 2.0
+				.quad 0x04000000000000000
+.L__real_ninf:			.quad 0x0fff0000000000000	# -inf
+				.quad 0x0fff0000000000000
+.L__real_inf:			.quad 0x07ff0000000000000	# +inf
+				.quad 0x07ff0000000000000
+.L__real_nan:			.quad 0x07ff8000000000000	# NaN
+				.quad 0x07ff8000000000000
+
+.L__real_zero:			.quad 0x00000000000000000	# 0.0
+				.quad 0x00000000000000000
+
+.L__real_sign:			.quad 0x08000000000000000	# sign bit
+				.quad 0x08000000000000000
+.L__real_notsign:		.quad 0x07ffFFFFFFFFFFFFF	# ^sign bit
+				.quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold:		.quad 0x03F9EB85000000000	# .03
+				.quad 0x03F9EB85000000000
+.L__real_qnanbit:		.quad 0x00008000000000000	# quiet nan bit
+				.quad 0x00008000000000000
+.L__real_mant:			.quad 0x0000FFFFFFFFFFFFF	# mantissa bits
+				.quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000:	.quad 0x03f80000000000000	# /* 0.0078125 = 1/128 */
+				.quad 0x03f80000000000000
+.L__mask_1023:			.quad 0x000000000000003ff	#
+				.quad 0x000000000000003ff
+.L__mask_040:			.quad 0x00000000000000040	#
+				.quad 0x00000000000000040
+.L__mask_001:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001
+
+.L__real_ca1:			.quad 0x03fb55555555554e6	# 8.33333333333317923934e-02
+				.quad 0x03fb55555555554e6
+.L__real_ca2:			.quad 0x03f89999999bac6d4	# 1.25000000037717509602e-02
+				.quad 0x03f89999999bac6d4
+.L__real_ca3:			.quad 0x03f62492307f1519f	# 2.23213998791944806202e-03
+				.quad 0x03f62492307f1519f
+.L__real_ca4:			.quad 0x03f3c8034c85dfff0	# 4.34887777707614552256e-04
+				.quad 0x03f3c8034c85dfff0
+
+.L__real_cb1:			.quad 0x03fb5555555555557	# 8.33333333333333593622e-02
+				.quad 0x03fb5555555555557
+.L__real_cb2:			.quad 0x03f89999999865ede	# 1.24999999978138668903e-02
+				.quad 0x03f89999999865ede
+.L__real_cb3:			.quad 0x03f6249423bd94741	# 2.23219810758559851206e-03
+				.quad 0x03f6249423bd94741
+.L__real_log2_lead:  		.quad 0x03fe62e42e0000000	# log2_lead	  6.93147122859954833984e-01
+				.quad 0x03fe62e42e0000000
+.L__real_log2_tail: 		.quad 0x03e6efa39ef35793c	# log2_tail	  5.76999904754328540596e-08
+				.quad 0x03e6efa39ef35793c
+
+.L__real_half:			.quad 0x03fe0000000000000	# 1/2
+						.quad 0x03fe0000000000000
+.L__real_log2e_lead:	.quad 0x03FF7154400000000	# log2e_lead	  1.44269180297851562500E+00
+						.quad 0x03FF7154400000000
+.L__real_log2e_tail :	.quad 0x03ECB295C17F0BBBE	# log2e_tail	  3.23791044778235969970E-06
+						.quad 0x03ECB295C17F0BBBE
+.L__mask_lower:			.quad 0x0ffffffff00000000
+						.quad 0x0ffffffff00000000
+
+.L__np_ln_lead_table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3f8fc0a800000000		# 1.55041813850402832031e-02
+	.quad	0x3f9f829800000000		# 3.07716131210327148438e-02
+	.quad	0x3fa7745800000000		# 4.58095073699951171875e-02
+	.quad	0x3faf0a3000000000		# 6.06245994567871093750e-02
+	.quad	0x3fb341d700000000		# 7.52233862876892089844e-02
+	.quad	0x3fb6f0d200000000		# 8.96121263504028320312e-02
+	.quad	0x3fba926d00000000		# 1.03796780109405517578e-01
+	.quad	0x3fbe270700000000		# 1.17783010005950927734e-01
+	.quad	0x3fc0d77e00000000		# 1.31576299667358398438e-01
+	.quad	0x3fc2955280000000		# 1.45181953907012939453e-01
+	.quad	0x3fc44d2b00000000		# 1.58604979515075683594e-01
+	.quad	0x3fc5ff3000000000		# 1.71850204467773437500e-01
+	.quad	0x3fc7ab8900000000		# 1.84922337532043457031e-01
+	.quad	0x3fc9525a80000000		# 1.97825729846954345703e-01
+	.quad	0x3fcaf3c900000000		# 2.10564732551574707031e-01
+	.quad	0x3fcc8ff780000000		# 2.23143517971038818359e-01
+	.quad	0x3fce270700000000		# 2.35566020011901855469e-01
+	.quad	0x3fcfb91800000000		# 2.47836112976074218750e-01
+	.quad	0x3fd0a324c0000000		# 2.59957492351531982422e-01
+	.quad	0x3fd1675c80000000		# 2.71933674812316894531e-01
+	.quad	0x3fd22941c0000000		# 2.83768117427825927734e-01
+	.quad	0x3fd2e8e280000000		# 2.95464158058166503906e-01
+	.quad	0x3fd3a64c40000000		# 3.07025015354156494141e-01
+	.quad	0x3fd4618bc0000000		# 3.18453729152679443359e-01
+	.quad	0x3fd51aad80000000		# 3.29753279685974121094e-01
+	.quad	0x3fd5d1bd80000000		# 3.40926527976989746094e-01
+	.quad	0x3fd686c800000000		# 3.51976394653320312500e-01
+	.quad	0x3fd739d7c0000000		# 3.62905442714691162109e-01
+	.quad	0x3fd7eaf800000000		# 3.73716354370117187500e-01
+	.quad	0x3fd89a3380000000		# 3.84411692619323730469e-01
+	.quad	0x3fd9479400000000		# 3.94993782043457031250e-01
+	.quad	0x3fd9f323c0000000		# 4.05465066432952880859e-01
+	.quad	0x3fda9cec80000000		# 4.15827870368957519531e-01
+	.quad	0x3fdb44f740000000		# 4.26084339618682861328e-01
+	.quad	0x3fdbeb4d80000000		# 4.36236739158630371094e-01
+	.quad	0x3fdc8ff7c0000000		# 4.46287095546722412109e-01
+	.quad	0x3fdd32fe40000000		# 4.56237375736236572266e-01
+	.quad	0x3fddd46a00000000		# 4.66089725494384765625e-01
+	.quad	0x3fde744240000000		# 4.75845873355865478516e-01
+	.quad	0x3fdf128f40000000		# 4.85507786273956298828e-01
+	.quad	0x3fdfaf5880000000		# 4.95077252388000488281e-01
+	.quad	0x3fe02552a0000000		# 5.04556000232696533203e-01
+	.quad	0x3fe0723e40000000		# 5.13945698738098144531e-01
+	.quad	0x3fe0be72e0000000		# 5.23248136043548583984e-01
+	.quad	0x3fe109f380000000		# 5.32464742660522460938e-01
+	.quad	0x3fe154c3c0000000		# 5.41597247123718261719e-01
+	.quad	0x3fe19ee6a0000000		# 5.50647079944610595703e-01
+	.quad	0x3fe1e85f40000000		# 5.59615731239318847656e-01
+	.quad	0x3fe23130c0000000		# 5.68504691123962402344e-01
+	.quad	0x3fe2795e00000000		# 5.77315330505371093750e-01
+	.quad	0x3fe2c0e9e0000000		# 5.86049020290374755859e-01
+	.quad	0x3fe307d720000000		# 5.94707071781158447266e-01
+	.quad	0x3fe34e2880000000		# 6.03290796279907226562e-01
+	.quad	0x3fe393e0c0000000		# 6.11801505088806152344e-01
+	.quad	0x3fe3d90260000000		# 6.20240390300750732422e-01
+	.quad	0x3fe41d8fe0000000		# 6.28608644008636474609e-01
+	.quad	0x3fe4618bc0000000		# 6.36907458305358886719e-01
+	.quad	0x3fe4a4f840000000		# 6.45137906074523925781e-01
+	.quad	0x3fe4e7d800000000		# 6.53301239013671875000e-01
+	.quad	0x3fe52a2d20000000		# 6.61398470401763916016e-01
+	.quad	0x3fe56bf9c0000000		# 6.69430613517761230469e-01
+	.quad	0x3fe5ad4040000000		# 6.77398800849914550781e-01
+	.quad	0x3fe5ee02a0000000		# 6.85303986072540283203e-01
+	.quad	0x3fe62e42e0000000		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_tail_table:
+	.quad	0x00000000000000000 # 0	; 0.00000000000000000000e+00
+	.quad	0x03e361f807c79f3db		# 5.15092497094772879206e-09
+	.quad	0x03e6873c1980267c8		# 4.55457209735272790188e-08
+	.quad	0x03e5ec65b9f88c69e		# 2.86612990859791781788e-08
+	.quad	0x03e58022c54cc2f99		# 2.23596477332056055352e-08
+	.quad	0x03e62c37a3a125330		# 3.49498983167142274770e-08
+	.quad	0x03e615cad69737c93		# 3.23392843005887000414e-08
+	.quad	0x03e4d256ab1b285e9		# 1.35722380472479366661e-08
+	.quad	0x03e5b8abcb97a7aa2		# 2.56504325268044191098e-08
+	.quad	0x03e6f34239659a5dc		# 5.81213608741512136843e-08
+	.quad	0x03e6e07fd48d30177		# 5.59374849578288093334e-08
+	.quad	0x03e6b32df4799f4f6		# 5.06615629004996189970e-08
+	.quad	0x03e6c29e4f4f21cf8		# 5.24588857848400955725e-08
+	.quad	0x03e1086c848df1b59		# 9.61968535632653505972e-10
+	.quad	0x03e4cf456b4764130		# 1.34829655346594463137e-08
+	.quad	0x03e63a02ffcb63398		# 3.65557749306383026498e-08
+	.quad	0x03e61e6a6886b0976		# 3.33431709374069198903e-08
+	.quad	0x03e6b8abcb97a7aa2		# 5.13008650536088382197e-08
+	.quad	0x03e6b578f8aa35552		# 5.09285070380306053751e-08
+	.quad	0x03e6139c871afb9fc		# 3.20853940845502057341e-08
+	.quad	0x03e65d5d30701ce64		# 4.06713248643004200446e-08
+	.quad	0x03e6de7bcb2d12142		# 5.57028186706125221168e-08
+	.quad	0x03e6d708e984e1664		# 5.48356693724804282546e-08
+	.quad	0x03e556945e9c72f36		# 1.99407553679345001938e-08
+	.quad	0x03e20e2f613e85bda		# 1.96585517245087232086e-09
+	.quad	0x03e3cb7e0b42724f6		# 6.68649386072067321503e-09
+	.quad	0x03e6fac04e52846c7		# 5.89936034642113390002e-08
+	.quad	0x03e5e9b14aec442be		# 2.85038578721554472484e-08
+	.quad	0x03e6b5de8034e7126		# 5.09746772910284482606e-08
+	.quad	0x03e6dc157e1b259d3		# 5.54234668933210171467e-08
+	.quad	0x03e3b05096ad69c62		# 6.29100830926604004874e-09
+	.quad	0x03e5c2116faba4cdd		# 2.61974119468563937716e-08
+	.quad	0x03e665fcc25f95b47		# 4.16752115011186398935e-08
+	.quad	0x03e5a9a08498d4850		# 2.47747534460820790327e-08
+	.quad	0x03e6de647b1465f77		# 5.56922172017964209793e-08
+	.quad	0x03e5da71b7bf7861d		# 2.76162876992552906035e-08
+	.quad	0x03e3e6a6886b09760		# 7.08169709942321478061e-09
+	.quad	0x03e6f0075eab0ef64		# 5.77453510221151779025e-08
+	.quad	0x03e33071282fb989b		# 4.43021445893361960146e-09
+	.quad	0x03e60eb43c3f1bed2		# 3.15140984357495864573e-08
+	.quad	0x03e5faf06ecb35c84		# 2.95077445089736670973e-08
+	.quad	0x03e4ef1e63db35f68		# 1.44098510263167149349e-08
+	.quad	0x03e469743fb1a71a5		# 1.05196987538551827693e-08
+	.quad	0x03e6c1cdf404e5796		# 5.23641361722697546261e-08
+	.quad	0x03e4094aa0ada625e		# 7.72099925253243069458e-09
+	.quad	0x03e6e2d4c96fde3ec		# 5.62089493829364197156e-08
+	.quad	0x03e62f4d5e9a98f34		# 3.53090261098577946927e-08
+	.quad	0x03e6467c96ecc5cbe		# 3.80080516835568242269e-08
+	.quad	0x03e6e7040d03dec5a		# 5.66961038386146408282e-08
+	.quad	0x03e67bebf4282de36		# 4.42287063097349852717e-08
+	.quad	0x03e6289b11aeb783f		# 3.45294525105681104660e-08
+	.quad	0x03e5a891d1772f538		# 2.47132034530447431509e-08
+	.quad	0x03e634f10be1fb591		# 3.59655343422487209774e-08
+	.quad	0x03e6d9ce1d316eb93		# 5.51581770357780862071e-08
+	.quad	0x03e63562a19a9c442		# 3.60171867511861372793e-08
+	.quad	0x03e54e2adf548084c		# 1.94511067964296180547e-08
+	.quad	0x03e508ce55cc8c97a		# 1.54137376631349347838e-08
+	.quad	0x03e30e2f613e85bda		# 3.93171034490174464173e-09
+	.quad	0x03e6db03ebb0227bf		# 5.52990607758839766440e-08
+	.quad	0x03e61b75bb09cb098		# 3.29990737637586136511e-08
+	.quad	0x03e496f16abb9df22		# 1.18436010922446096216e-08
+	.quad	0x03e65b3f399411c62		# 4.04248680368301346709e-08
+	.quad	0x03e586b3e59f65355		# 2.27418915900284316293e-08
+	.quad	0x03e52482ceae1ac12		# 1.70263791333409206020e-08
+	.quad	0x03e6efa39ef35793c		# 5.76999904754328540596e-08
+	.quad 0					# for alignment
+

diff --git a/src/gas/vrdalogr.S b/src/gas/vrdalogr.S
new file mode 100644
index 0000000..4064fb3
--- /dev/null
+++ b/src/gas/vrdalogr.S

@@ -0,0 +1,2428 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdalog.asm
+#
+# An array implementation of the log libm function.
+#
+# Prototype:
+#
+#    void vrda_logr(int n, double *x, double *y);
+#
+#   Computes the natural log of x.
+# A reduced precision routine.   Uses the intel novel reduction technique
+# with frcpa.  Also uses only 3 polynomial terms to acheive52-18= 34 significant digits
+#
+#   This specialized routine does not handle negative numbers, 0, NaNs, or infin ity.
+#   This routine is not C99 compliant
+#   This version can compute logs in 26
+#   cycles with n <= 24
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ	p_x,0			# temporary for error checking operation
+
+.equ	p_x2,0x030		# temporary for error checking operation
+
+.equ	save_xa,0x060		#qword
+.equ	save_ya,0x068		#qword
+.equ	save_nv,0x070		#qword
+.equ	p_iter,0x078		# qword	storage for number of loop iterations
+
+.equ	save_rbx,0x080		#qword
+.equ	save_rdi,0x088		#qword
+
+.equ	save_rsi,0x090		#qword
+
+
+
+.equ	p2_temp,0x0e0		# second temporary for get/put bits operation
+.equ	p2_temp1,0x0f0		# second temporary for exponent multiply
+
+
+
+.equ	stack_size,0x0118
+
+        .weak vrda_logr_
+        .set vrda_logr_,__vrda_logr__
+        .weak vrda_logr__
+        .set vrda_logr__,__vrda_logr__
+
+# parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array log
+#**     VRDA_LOGR(N,X,Y)
+#** C equivalent
+#*/
+#void vrda_logr_(int * n, double *x, double *y)
+#{
+#       vrda_logr(*n,x,y);
+#}
+.globl __vrda_logr__
+    .type   __vrda_logr__,@function
+__vrda_logr__:
+    mov         (%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+.globl vrda_logr
+    .type   vrda_logr,@function
+vrda_logr:
+        sub             $stack_size,%rsp
+        mov             %rbx,save_rbx(%rsp)     # save rbx
+
+# save the arguments
+        mov             %rsi,save_xa(%rsp)      # save x_array pointer
+        mov             %rdx,save_ya(%rsp)      # save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+
+        mov             %rdi,save_nv(%rsp)      # save number of values
+# see if too few values to call the main loop
+	shr		$2,%rax			# get number of iterations
+	jz		.L__vda_cleanup		# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi		# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movlpd	(%rsi),%xmm0
+	movhpd	8(%rsi),%xmm0
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+		movlpd	-16(%rsi),%xmm1
+		movhpd	-8(%rsi),%xmm1
+# compute the logs
+
+##  if NaN or inf
+	movdqa	%xmm0,p_x(%rsp)	# save the input values
+
+# use the  algorithm referenced in the itanic trancendental paper.
+
+# reduction
+#  compute r = x frcpa(x) - 1
+	movdqa	%xmm0,%xmm8
+	movdqa	%xmm1,%xmm9
+
+	call	__vrd4_frcpa@PLT
+	movdqa	%xmm8,%xmm4
+		movdqa	%xmm9,%xmm7
+# invert the exponent
+	psllq	$1,%xmm8
+		psllq	$1,%xmm9
+	mulpd	%xmm0,%xmm4				# r
+		mulpd	%xmm1,%xmm7			# r
+	movdqa	%xmm8,%xmm5
+	paddq	.L__mask_rup(%rip),%xmm8
+	psrlq	$53,%xmm8
+		movdqa	%xmm9,%xmm6
+		paddq	.L__mask_rup(%rip),%xmm6
+		psrlq	$53,%xmm6
+	psubq	.L__mask_3ff(%rip),%xmm8
+		psubq	.L__mask_3ff(%rip),%xmm6
+	pshufd	$0x058,%xmm8,%xmm8
+		pshufd	$0x058,%xmm6,%xmm6
+
+
+	subpd	.L__real_one(%rip),%xmm4
+		subpd	.L__real_one(%rip),%xmm7
+
+	cvtdq2pd	%xmm8,%xmm0		#N
+		cvtdq2pd	%xmm6,%xmm1		#N
+
+# compute index for table lookup. if 1/2 bit set, increment the index+exponent
+	psrlq	$42,%xmm5
+		psrlq	$42,%xmm9
+	paddq	.L__int_one(%rip),%xmm5
+		paddq	.L__int_one(%rip),%xmm9
+	psrlq	$1,%xmm5
+		psrlq	$1,%xmm9
+	pand	.L__mask_3ff(%rip),%xmm5
+		pand	.L__mask_3ff(%rip),%xmm9
+	psllq	$1,%xmm5
+		psllq	$1,%xmm9
+
+	movdqa	%xmm5,p_x(%rsp)	# move the indexes to a memory location
+		movdqa	%xmm9,p_x2(%rsp)
+
+
+	movapd	.L__real_third(%rip),%xmm3
+		movdqa	%xmm3,%xmm5
+	movapd	%xmm4,%xmm2
+		movapd	%xmm7,%xmm8
+
+# approximation
+#  compute the polynomial
+#   p(r) = p1r^2+p2r^3+p3r^4+p4r^5
+
+	mulpd	%xmm4,%xmm2			#r^2
+		mulpd	%xmm7,%xmm8			#r^2
+
+# eliminating the 4th and 5th terms gets us to 8000ulps, or 53-16=37 significant digits
+# The routine runs in 60 cycles.
+	mulpd	%xmm4,%xmm3			# 1/3r
+		mulpd	%xmm7,%xmm5			# 1/3r
+# lookup the f(k) term
+	lea		.L__np_lnf_table(%rip),%rdx
+	mov		p_x(%rsp),%rcx
+	mov		p_x+8(%rsp),%r9
+	movlpd		(%rdx,%rcx,8),%xmm6  	# lookup
+	movhpd		(%rdx,%r9,8),%xmm6  	# lookup
+
+	addpd	.L__real_half(%rip),%xmm3  # p2 + p3r
+		addpd	.L__real_half(%rip),%xmm5  # p2 + p3r
+
+		mov		p_x2(%rsp),%rcx
+		mov		p_x2+8(%rsp),%r9
+		movlpd		(%rdx,%rcx,8),%xmm9  	# lookup
+		movhpd		(%rdx,%r9,8),%xmm9  	# lookup
+
+	mulpd	%xmm3,%xmm2			# r2(p2 + p3r)
+		mulpd	%xmm5,%xmm8			# r2(p2 + p3r)
+	addpd	%xmm4,%xmm2			# +r
+		addpd	%xmm7,%xmm8			# +r
+
+
+#	reconstruction
+#  compute ln(x) = T + r + p(r) where
+#   T =	N*ln(2)+ln(1/frcpa(x)) via tab of ln(1/frcpa(y)), where y = 1 +	k/256, 0<=k<=255
+
+	mulpd	.L__real_log2(%rip),%xmm0	 # compute  N*__real_log2
+		mulpd	.L__real_log2(%rip),%xmm1	 # compute  N*__real_log2
+	addpd	%xmm6,%xmm2	# add the new mantissas
+		addpd	%xmm9,%xmm8	# add the new mantissas
+	addpd	%xmm2,%xmm0
+		addpd	%xmm8,%xmm1
+# store the result _m128d
+	mov		save_ya(%rsp),%rdi	# get y_array pointer
+	movlpd	%xmm0,(%rdi)
+	movhpd	  %xmm0,8(%rdi)
+
+
+	prefetch	64(%rdi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+		movlpd	%xmm1,-16(%rdi)
+		movhpd	%xmm1,-8(%rdi)
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vda_top
+
+
+# see if we need to do any extras
+	mov		save_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz		.L__vda_cleanup
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+#  we assume that rdx is pointing at the next x array element,
+#  r8 at the next y array element.  The number of values left is in
+#  save_nv
+.L__vda_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__finish		# exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorpd		%xmm0,%xmm0
+        movlpd  %xmm0,p_x+8(%rsp)
+        movapd  %xmm0,p_x+16(%rsp)
+
+	mov		(%rsi),%rcx			# we know there's at least one
+	mov	 	%rcx,p_x(%rsp)
+	cmp		$2,%rax
+	jl		.L__vdacg
+
+	mov		8(%rsi),%rcx			# do the second value
+	mov	 	%rcx,p_x+8(%rsp)
+	cmp		$3,%rax
+	jl		.L__vdacg
+
+	mov		16(%rsi),%rcx			# do the third value
+	mov	 	%rcx,p_x+16(%rsp)
+
+.L__vdacg:
+	mov		$4,%rcx				# parameter for N
+	lea		p_x(%rsp),%rdx	# &x parameter
+	lea		p2_temp(%rsp),%r8	# &y parameter
+	call	vrda_logr@PLT				# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		save_ya(%rsp),%rdi
+	mov		save_nv(%rsp),%rax	# get number of values
+	mov	 	p2_temp(%rsp),%rcx
+	mov		%rcx,(%rdi)			# we know there's at least one
+	cmp		$2,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+8(%rsp),%rcx
+	mov		%rcx,8(%rdi)			# do the second value
+	cmp		$3,%rax
+	jl		.L__vdacgf
+
+	mov	 	p2_temp+16(%rsp),%rcx
+	mov		%rcx,16(%rdi)			# do the third value
+
+.L__vdacgf:
+	jmp		.L__finish
+
+        .data
+        .align  64
+
+.L__real_one:		.quad 0x03ff0000000000000	# 1.0
+			.quad 0x03ff0000000000000
+
+.L__real_half:		.quad 0x0bfe0000000000000	# 1/2
+			.quad 0x0bfe0000000000000
+.L__real_third:		.quad 0x03fd5555555555555	# 1/3
+			.quad 0x03fd5555555555555
+.L__real_fourth:	.quad 0x0bfd0000000000000	# 1/4
+			.quad 0x0bfd0000000000000
+.L__real_fifth:		.quad 0x03fc999999999999a	# 1/5
+			.quad 0x03fc999999999999a
+.L__real_sixth:		.quad 0x0bfc5555555555555	# 1/6
+			.quad 0x0bfc5555555555555
+
+.L__real_log2:	        .quad 0x03FE62E42FEFA39EF  # 0.693147182465
+		        .quad 0x03FE62E42FEFA39EF
+
+.L__mask_3ff:		.quad 0x000000000000003ff	#
+			.quad 0x000000000000003ff
+
+.L__mask_rup:           .quad 0x0000003fffffffffe
+                        .quad 0x0000003fffffffffe
+
+.L__int_one:		.quad 0x00000000000000001
+			.quad 0x00000000000000001
+
+
+
+.L__mask_10bits:	.quad 0x000000000000003ff
+			.quad 0x000000000000003ff
+
+.L__mask_expext:	.quad 0x000000000003ff000
+			.quad 0x000000000003ff000
+
+.L__mask_expext2:	.quad 0x000000000003ff800
+			.quad 0x000000000003ff800
+
+
+
+
+.L__np_lnf_table:
+#log table Program - logtab.c
+#Built Jan 18 2006  09:51:57
+#Compiler version  1400
+
+    .quad 0x00000000000000000  # 0.000000000000 0
+    .quad 0x00000000000000000
+    .quad 0x03F50020055655885  # 0.000977039648 1
+    .quad 0x03F50020055655885
+    .quad 0x03F60040155D5881E  # 0.001955034836 2
+    .quad 0x03F60040155D5881E
+    .quad 0x03F6809048289860A  # 0.002933987435 3
+    .quad 0x03F6809048289860A
+    .quad 0x03F70080559588B25  # 0.003913899321 4
+    .quad 0x03F70080559588B25
+    .quad 0x03F740C8A7478788D  # 0.004894772377 5
+    .quad 0x03F740C8A7478788D
+    .quad 0x03F78121214586B02  # 0.005876608489 6
+    .quad 0x03F78121214586B02
+    .quad 0x03F7C189CBB0E283F  # 0.006859409551 7
+    .quad 0x03F7C189CBB0E283F
+    .quad 0x03F8010157588DE69  # 0.007843177461 8
+    .quad 0x03F8010157588DE69
+    .quad 0x03F82145E939EF1BC  # 0.008827914124 9
+    .quad 0x03F82145E939EF1BC
+    .quad 0x03F83D8896A83D7A8  # 0.009690354884 10
+    .quad 0x03F83D8896A83D7A8
+    .quad 0x03F85DDC705054DFF  # 0.010676913110 11
+    .quad 0x03F85DDC705054DFF
+    .quad 0x03F87E38762CA0C6D  # 0.011664445593 12
+    .quad 0x03F87E38762CA0C6D
+    .quad 0x03F89E9CAC6007563  # 0.012652954261 13
+    .quad 0x03F89E9CAC6007563
+    .quad 0x03F8BF091710935A4  # 0.013642441046 14
+    .quad 0x03F8BF091710935A4
+    .quad 0x03F8DF7DBA6777895  # 0.014632907884 15
+    .quad 0x03F8DF7DBA6777895
+    .quad 0x03F8FBEA8B13C03F9  # 0.015500371846 16
+    .quad 0x03F8FBEA8B13C03F9
+    .quad 0x03F90E3751F24F45C  # 0.016492681528 17
+    .quad 0x03F90E3751F24F45C
+    .quad 0x03F91E7D80B1FBF4C  # 0.017485976867 18
+    .quad 0x03F91E7D80B1FBF4C
+    .quad 0x03F92CBE4F6CC56C3  # 0.018355920375 19
+    .quad 0x03F92CBE4F6CC56C3
+    .quad 0x03F93D0C443D7258C  # 0.019351069108 20
+    .quad 0x03F93D0C443D7258C
+    .quad 0x03F94D5E6176ACC89  # 0.020347209148 21
+    .quad 0x03F94D5E6176ACC89
+    .quad 0x03F95DB4A937DEF10  # 0.021344342472 22
+    .quad 0x03F95DB4A937DEF10
+    .quad 0x03F96C039490E37F4  # 0.022217650494 23
+    .quad 0x03F96C039490E37F4
+    .quad 0x03F97C61B1CF5DED7  # 0.023216651576 24
+    .quad 0x03F97C61B1CF5DED7
+    .quad 0x03F98AB77B3FD6EAD  # 0.024091596947 25
+    .quad 0x03F98AB77B3FD6EAD
+    .quad 0x03F99B1D75828E780  # 0.025092472797 26
+    .quad 0x03F99B1D75828E780
+    .quad 0x03F9AB87A478CB7CB  # 0.026094351403 27
+    .quad 0x03F9AB87A478CB7CB
+    .quad 0x03F9B9E8027E1916F  # 0.026971819338 28
+    .quad 0x03F9B9E8027E1916F
+    .quad 0x03F9CA5A1A18613E6  # 0.027975583538 29
+    .quad 0x03F9CA5A1A18613E6
+    .quad 0x03F9D8C1670325921  # 0.028854704473 30
+    .quad 0x03F9D8C1670325921
+    .quad 0x03F9E93B6EE41F674  # 0.029860361378 31
+    .quad 0x03F9E93B6EE41F674
+    .quad 0x03F9F7A9B16782855  # 0.030741141554 32
+    .quad 0x03F9F7A9B16782855
+    .quad 0x03FA0415D89E74440  # 0.031748698315 33
+    .quad 0x03FA0415D89E74440
+    .quad 0x03FA0C58FA19DFAAB  # 0.032757271269 34
+    .quad 0x03FA0C58FA19DFAAB
+    .quad 0x03FA139577CC41C1A  # 0.033640607815 35
+    .quad 0x03FA139577CC41C1A
+    .quad 0x03FA1AD398C6CD57C  # 0.034524725334 36
+    .quad 0x03FA1AD398C6CD57C
+    .quad 0x03FA231C9C40E204E  # 0.035536103423 37
+    .quad 0x03FA231C9C40E204E
+    .quad 0x03FA2A5E4231CF7BD  # 0.036421899115 38
+    .quad 0x03FA2A5E4231CF7BD
+    .quad 0x03FA32AB4D4C59CB0  # 0.037435198758 39
+    .quad 0x03FA32AB4D4C59CB0
+    .quad 0x03FA39F07BA0EBD5A  # 0.038322679007 40
+    .quad 0x03FA39F07BA0EBD5A
+    .quad 0x03FA424192495D571  # 0.039337907520 41
+    .quad 0x03FA424192495D571
+    .quad 0x03FA498A4C73DA65D  # 0.040227078744 42
+    .quad 0x03FA498A4C73DA65D
+    .quad 0x03FA50D4AF75CA86F  # 0.041117041297 43
+    .quad 0x03FA50D4AF75CA86F
+    .quad 0x03FA592BBC15215BC  # 0.042135112141 44
+    .quad 0x03FA592BBC15215BC
+    .quad 0x03FA6079B00423FF6  # 0.043026775152 45
+    .quad 0x03FA6079B00423FF6
+    .quad 0x03FA67C94F2D4BB65  # 0.043919233935 46
+    .quad 0x03FA67C94F2D4BB65
+    .quad 0x03FA70265A550E77B  # 0.044940163069 47
+    .quad 0x03FA70265A550E77B
+    .quad 0x03FA77798F8D6DFDC  # 0.045834331871 48
+    .quad 0x03FA77798F8D6DFDC
+    .quad 0x03FA7ECE7267CD123  # 0.046729300926 49
+    .quad 0x03FA7ECE7267CD123
+    .quad 0x03FA873184BC09586  # 0.047753104446 50
+    .quad 0x03FA873184BC09586
+    .quad 0x03FA8E8A02D2E3175  # 0.048649793163 51
+    .quad 0x03FA8E8A02D2E3175
+    .quad 0x03FA95E430F8CE456  # 0.049547286652 52
+    .quad 0x03FA95E430F8CE456
+    .quad 0x03FA9D400FF482586  # 0.050445586359 53
+    .quad 0x03FA9D400FF482586
+    .quad 0x03FAA5AB21CB34A9E  # 0.051473203662 54
+    .quad 0x03FAA5AB21CB34A9E
+    .quad 0x03FAAD0AA2E784EF4  # 0.052373235867 55
+    .quad 0x03FAAD0AA2E784EF4
+    .quad 0x03FAB46BD74DA76A0  # 0.053274078860 56
+    .quad 0x03FAB46BD74DA76A0
+    .quad 0x03FABBCEBFC68F424  # 0.054175734102 57
+    .quad 0x03FABBCEBFC68F424
+    .quad 0x03FAC3335D1BBAE4D  # 0.055078203060 58
+    .quad 0x03FAC3335D1BBAE4D
+    .quad 0x03FACBA87200EB8F1  # 0.056110594428 59
+    .quad 0x03FACBA87200EB8F1
+    .quad 0x03FAD310BA20455A2  # 0.057014812019 60
+    .quad 0x03FAD310BA20455A2
+    .quad 0x03FADA7AB998B77ED  # 0.057919847959 61
+    .quad 0x03FADA7AB998B77ED
+    .quad 0x03FAE1E6713606CFB  # 0.058825703731 62
+    .quad 0x03FAE1E6713606CFB
+    .quad 0x03FAE953E1C48603A  # 0.059732380822 63
+    .quad 0x03FAE953E1C48603A
+    .quad 0x03FAF0C30C1116351  # 0.060639880722 64
+    .quad 0x03FAF0C30C1116351
+    .quad 0x03FAF833F0E927711  # 0.061548204926 65
+    .quad 0x03FAF833F0E927711
+    .quad 0x03FAFFA6911AB9309  # 0.062457354934 66
+    .quad 0x03FAFFA6911AB9309
+    .quad 0x03FB038D76BA2D737  # 0.063367332247 67
+    .quad 0x03FB038D76BA2D737
+    .quad 0x03FB0748836296412  # 0.064278138373 68
+    .quad 0x03FB0748836296412
+    .quad 0x03FB0B046EEE6F7A4  # 0.065189774824 69
+    .quad 0x03FB0B046EEE6F7A4
+    .quad 0x03FB0EC139C5DA5FD  # 0.066102243114 70
+    .quad 0x03FB0EC139C5DA5FD
+    .quad 0x03FB127EE451413A8  # 0.067015544762 71
+    .quad 0x03FB127EE451413A8
+    .quad 0x03FB163D6EF9579FC  # 0.067929681294 72
+    .quad 0x03FB163D6EF9579FC
+    .quad 0x03FB19FCDA271ABC0  # 0.068844654235 73
+    .quad 0x03FB19FCDA271ABC0
+    .quad 0x03FB1DBD2643D1912  # 0.069760465119 74
+    .quad 0x03FB1DBD2643D1912
+    .quad 0x03FB217E53B90D3CE  # 0.070677115481 75
+    .quad 0x03FB217E53B90D3CE
+    .quad 0x03FB254062F0A9417  # 0.071594606862 76
+    .quad 0x03FB254062F0A9417
+    .quad 0x03FB29035454CBCB0  # 0.072512940806 77
+    .quad 0x03FB29035454CBCB0
+    .quad 0x03FB2CC7284FE5F1A  # 0.073432118863 78
+    .quad 0x03FB2CC7284FE5F1A
+    .quad 0x03FB308BDF4CB4062  # 0.074352142586 79
+    .quad 0x03FB308BDF4CB4062
+    .quad 0x03FB345179B63DD3F  # 0.075273013532 80
+    .quad 0x03FB345179B63DD3F
+    .quad 0x03FB3817F7F7D6EAB  # 0.076194733263 81
+    .quad 0x03FB3817F7F7D6EAB
+    .quad 0x03FB3BDF5A7D1EE5E  # 0.077117303344 82
+    .quad 0x03FB3BDF5A7D1EE5E
+    .quad 0x03FB3F1D405CE86D3  # 0.077908755701 83
+    .quad 0x03FB3F1D405CE86D3
+    .quad 0x03FB42E64BEC266E4  # 0.078832909176 84
+    .quad 0x03FB42E64BEC266E4
+    .quad 0x03FB46B03CF437BC4  # 0.079757917501 85
+    .quad 0x03FB46B03CF437BC4
+    .quad 0x03FB4A7B13E1E3E65  # 0.080683782259 86
+    .quad 0x03FB4A7B13E1E3E65
+    .quad 0x03FB4E46D1223FE84  # 0.081610505036 87
+    .quad 0x03FB4E46D1223FE84
+    .quad 0x03FB52137522AE732  # 0.082538087426 88
+    .quad 0x03FB52137522AE732
+    .quad 0x03FB5555DE434F2A0  # 0.083333843436 89
+    .quad 0x03FB5555DE434F2A0
+    .quad 0x03FB59242FF043D34  # 0.084263026485 90
+    .quad 0x03FB59242FF043D34
+    .quad 0x03FB5CF36997817B2  # 0.085193073719 91
+    .quad 0x03FB5CF36997817B2
+    .quad 0x03FB60C38BA799459  # 0.086123986746 92
+    .quad 0x03FB60C38BA799459
+    .quad 0x03FB6408F471C82A2  # 0.086922602521 93
+    .quad 0x03FB6408F471C82A2
+    .quad 0x03FB67DAC7466CB96  # 0.087855127734 94
+    .quad 0x03FB67DAC7466CB96
+    .quad 0x03FB6BAD83C1883BA  # 0.088788523361 95
+    .quad 0x03FB6BAD83C1883BA
+    .quad 0x03FB6EF528C056A2D  # 0.089589270768 96
+    .quad 0x03FB6EF528C056A2D
+    .quad 0x03FB72C9985035BB1  # 0.090524287199 97
+    .quad 0x03FB72C9985035BB1
+    .quad 0x03FB769EF2C6B5688  # 0.091460178704 98
+    .quad 0x03FB769EF2C6B5688
+    .quad 0x03FB79E8D70A364C6  # 0.092263069152 99
+    .quad 0x03FB79E8D70A364C6
+    .quad 0x03FB7DBFE6EA733FE  # 0.093200590148 100
+    .quad 0x03FB7DBFE6EA733FE
+    .quad 0x03FB8197E2F40E3F0  # 0.094138990914 101
+    .quad 0x03FB8197E2F40E3F0
+    .quad 0x03FB84E40992A4804  # 0.094944035906 102
+    .quad 0x03FB84E40992A4804
+    .quad 0x03FB88BDBD5FC66D2  # 0.095884074919 103
+    .quad 0x03FB88BDBD5FC66D2
+    .quad 0x03FB8C985E9B9EC7E  # 0.096824998438 104
+    .quad 0x03FB8C985E9B9EC7E
+    .quad 0x03FB8FE6CAB20E979  # 0.097632209567 105
+    .quad 0x03FB8FE6CAB20E979
+    .quad 0x03FB93C3261014C65  # 0.098574780162 106
+    .quad 0x03FB93C3261014C65
+    .quad 0x03FB97130DC9235DE  # 0.099383405543 107
+    .quad 0x03FB97130DC9235DE
+    .quad 0x03FB9AF124D64C623  # 0.100327628989 108
+    .quad 0x03FB9AF124D64C623
+    .quad 0x03FB9E4289871E964  # 0.101137673586 109
+    .quad 0x03FB9E4289871E964
+    .quad 0x03FBA2225DD276FCB  # 0.102083555691 110
+    .quad 0x03FBA2225DD276FCB
+    .quad 0x03FBA57540D1FE441  # 0.102895024494 111
+    .quad 0x03FBA57540D1FE441
+    .quad 0x03FBA956D3ECADE60  # 0.103842571097 112
+    .quad 0x03FBA956D3ECADE60
+    .quad 0x03FBACAB3693AB9C0  # 0.104655469123 113
+    .quad 0x03FBACAB3693AB9C0
+    .quad 0x03FBB08E8A10F96F4  # 0.105604686090 114
+    .quad 0x03FBB08E8A10F96F4
+    .quad 0x03FBB3E46DBA02181  # 0.106419018383 115
+    .quad 0x03FBB3E46DBA02181
+    .quad 0x03FBB7C9832F58018  # 0.107369911615 116
+    .quad 0x03FBB7C9832F58018
+    .quad 0x03FBBB20E936D6976  # 0.108185683244 117
+    .quad 0x03FBBB20E936D6976
+    .quad 0x03FBBF07C23BC54EA  # 0.109138258671 118
+    .quad 0x03FBBF07C23BC54EA
+    .quad 0x03FBC260ABFFFE972  # 0.109955474734 119
+    .quad 0x03FBC260ABFFFE972
+    .quad 0x03FBC6494A2E418A0  # 0.110909738320 120
+    .quad 0x03FBC6494A2E418A0
+    .quad 0x03FBC9A3B90F57748  # 0.111728403941 121
+    .quad 0x03FBC9A3B90F57748
+    .quad 0x03FBCCFEDBFEE13A8  # 0.112547740324 122
+    .quad 0x03FBCCFEDBFEE13A8
+    .quad 0x03FBD0EA1362CDBFC  # 0.113504482008 123
+    .quad 0x03FBD0EA1362CDBFC
+    .quad 0x03FBD446BD753D433  # 0.114325275488 124
+    .quad 0x03FBD446BD753D433
+    .quad 0x03FBD7A41C8627307  # 0.115146743223 125
+    .quad 0x03FBD7A41C8627307
+    .quad 0x03FBDB91F09680DF9  # 0.116105975911 126
+    .quad 0x03FBDB91F09680DF9
+    .quad 0x03FBDEF0D8D466DBB  # 0.116928908339 127
+    .quad 0x03FBDEF0D8D466DBB
+    .quad 0x03FBE2507702AF03B  # 0.117752518544 128
+    .quad 0x03FBE2507702AF03B
+    .quad 0x03FBE640EB3D2B411  # 0.118714255240 129
+    .quad 0x03FBE640EB3D2B411
+    .quad 0x03FBE9A214A69DD58  # 0.119539337795 130
+    .quad 0x03FBE9A214A69DD58
+    .quad 0x03FBED03F4F440969  # 0.120365101673 131
+    .quad 0x03FBED03F4F440969
+    .quad 0x03FBF0F70CDD992E4  # 0.121329355484 132
+    .quad 0x03FBF0F70CDD992E4
+    .quad 0x03FBF45A7A78B7C3B  # 0.122156599431 133
+    .quad 0x03FBF45A7A78B7C3B
+    .quad 0x03FBF7BE9FEDBFDED  # 0.122984528276 134
+    .quad 0x03FBF7BE9FEDBFDED
+    .quad 0x03FBFB237D8AB13FB  # 0.123813143156 135
+    .quad 0x03FBFB237D8AB13FB
+    .quad 0x03FBFF1A13EAC95FD  # 0.124780729104 136
+    .quad 0x03FBFF1A13EAC95FD
+    .quad 0x03FC014040CAB0229  # 0.125610834299 137
+    .quad 0x03FC014040CAB0229
+    .quad 0x03FC02F3D4301417B  # 0.126441629140 138
+    .quad 0x03FC02F3D4301417B
+    .quad 0x03FC04A7C44CF87A4  # 0.127273114776 139
+    .quad 0x03FC04A7C44CF87A4
+    .quad 0x03FC06A4D1D26C5E9  # 0.128244055971 140
+    .quad 0x03FC06A4D1D26C5E9
+    .quad 0x03FC08598B59E3A07  # 0.129077042275 141
+    .quad 0x03FC08598B59E3A07
+    .quad 0x03FC0A0EA2164AF02  # 0.129910723024 142
+    .quad 0x03FC0A0EA2164AF02
+    .quad 0x03FC0BC4162F73B66  # 0.130745099376 143
+    .quad 0x03FC0BC4162F73B66
+    .quad 0x03FC0D79E7CD48E58  # 0.131580172493 144
+    .quad 0x03FC0D79E7CD48E58
+    .quad 0x03FC0F301717CF0FB  # 0.132415943541 145
+    .quad 0x03FC0F301717CF0FB
+    .quad 0x03FC10E6A437247B7  # 0.133252413686 146
+    .quad 0x03FC10E6A437247B7
+    .quad 0x03FC12E6BFA8FEAD6  # 0.134229180665 147
+    .quad 0x03FC12E6BFA8FEAD6
+    .quad 0x03FC149E189F8642E  # 0.135067169541 148
+    .quad 0x03FC149E189F8642E
+    .quad 0x03FC1655CFEA923A4  # 0.135905861231 149
+    .quad 0x03FC1655CFEA923A4
+    .quad 0x03FC180DE5B2ACE5C  # 0.136745256915 150
+    .quad 0x03FC180DE5B2ACE5C
+    .quad 0x03FC19C65A207AC07  # 0.137585357777 151
+    .quad 0x03FC19C65A207AC07
+    .quad 0x03FC1B7F2D5CBA842  # 0.138426165001 152
+    .quad 0x03FC1B7F2D5CBA842
+    .quad 0x03FC1D385F90453F2  # 0.139267679777 153
+    .quad 0x03FC1D385F90453F2
+    .quad 0x03FC1EF1F0E40E6CD  # 0.140109903297 154
+    .quad 0x03FC1EF1F0E40E6CD
+    .quad 0x03FC20ABE18124098  # 0.140952836755 155
+    .quad 0x03FC20ABE18124098
+    .quad 0x03FC22663190AEACC  # 0.141796481350 156
+    .quad 0x03FC22663190AEACC
+    .quad 0x03FC2420E13BF19E3  # 0.142640838281 157
+    .quad 0x03FC2420E13BF19E3
+    .quad 0x03FC25DBF0AC4AED2  # 0.143485908754 158
+    .quad 0x03FC25DBF0AC4AED2
+    .quad 0x03FC2797600B3387B  # 0.144331693975 159
+    .quad 0x03FC2797600B3387B
+    .quad 0x03FC29532F823F525  # 0.145178195155 160
+    .quad 0x03FC29532F823F525
+    .quad 0x03FC2B0F5F3B1D3EF  # 0.146025413505 161
+    .quad 0x03FC2B0F5F3B1D3EF
+    .quad 0x03FC2CCBEF5F97653  # 0.146873350243 162
+    .quad 0x03FC2CCBEF5F97653
+    .quad 0x03FC2E88E01993187  # 0.147722006588 163
+    .quad 0x03FC2E88E01993187
+    .quad 0x03FC3046319311009  # 0.148571383763 164
+    .quad 0x03FC3046319311009
+    .quad 0x03FC3203E3F62D328  # 0.149421482992 165
+    .quad 0x03FC3203E3F62D328
+    .quad 0x03FC33C1F76D1F469  # 0.150272305505 166
+    .quad 0x03FC33C1F76D1F469
+    .quad 0x03FC35806C223A70F  # 0.151123852534 167
+    .quad 0x03FC35806C223A70F
+    .quad 0x03FC373F423FED9A1  # 0.151976125313 168
+    .quad 0x03FC373F423FED9A1
+    .quad 0x03FC38FE79F0C3771  # 0.152829125080 169
+    .quad 0x03FC38FE79F0C3771
+    .quad 0x03FC3ABE135F62A12  # 0.153682853077 170
+    .quad 0x03FC3ABE135F62A12
+    .quad 0x03FC3C335E0447D71  # 0.154394850259 171
+    .quad 0x03FC3C335E0447D71
+    .quad 0x03FC3DF3AB13505F9  # 0.155249916579 172
+    .quad 0x03FC3DF3AB13505F9
+    .quad 0x03FC3FB45A59928CA  # 0.156105714663 173
+    .quad 0x03FC3FB45A59928CA
+    .quad 0x03FC41756C0220C81  # 0.156962245765 174
+    .quad 0x03FC41756C0220C81
+    .quad 0x03FC4336E03829D61  # 0.157819511141 175
+    .quad 0x03FC4336E03829D61
+    .quad 0x03FC44F8B726F8EFE  # 0.158677512051 176
+    .quad 0x03FC44F8B726F8EFE
+    .quad 0x03FC46BAF0F9F5DB8  # 0.159536249760 177
+    .quad 0x03FC46BAF0F9F5DB8
+    .quad 0x03FC48326CD3EC797  # 0.160252428262 178
+    .quad 0x03FC48326CD3EC797
+    .quad 0x03FC49F55C6502F81  # 0.161112520058 179
+    .quad 0x03FC49F55C6502F81
+    .quad 0x03FC4BB8AF55DE908  # 0.161973352249 180
+    .quad 0x03FC4BB8AF55DE908
+    .quad 0x03FC4D7C65D25566D  # 0.162834926111 181
+    .quad 0x03FC4D7C65D25566D
+    .quad 0x03FC4F4080065AA7F  # 0.163697242922 182
+    .quad 0x03FC4F4080065AA7F
+    .quad 0x03FC50B98CD30A759  # 0.164416408720 183
+    .quad 0x03FC50B98CD30A759
+    .quad 0x03FC527E5E4A1B58D  # 0.165280090939 184
+    .quad 0x03FC527E5E4A1B58D
+    .quad 0x03FC544393F5DF80F  # 0.166144519750 185
+    .quad 0x03FC544393F5DF80F
+    .quad 0x03FC56092E02BA514  # 0.167009696444 186
+    .quad 0x03FC56092E02BA514
+    .quad 0x03FC57837B3098F2C  # 0.167731249257 187
+    .quad 0x03FC57837B3098F2C
+    .quad 0x03FC5949CDB873419  # 0.168597800437 188
+    .quad 0x03FC5949CDB873419
+    .quad 0x03FC5B10851FC924A  # 0.169465103180 189
+    .quad 0x03FC5B10851FC924A
+    .quad 0x03FC5C8BC079D8289  # 0.170188430518 190
+    .quad 0x03FC5C8BC079D8289
+    .quad 0x03FC5E533144C1718  # 0.171057114516 191
+    .quad 0x03FC5E533144C1718
+    .quad 0x03FC601B076E7A8A8  # 0.171926553783 192
+    .quad 0x03FC601B076E7A8A8
+    .quad 0x03FC619732215D786  # 0.172651664394 193
+    .quad 0x03FC619732215D786
+    .quad 0x03FC635FC298F6C77  # 0.173522491735 194
+    .quad 0x03FC635FC298F6C77
+    .quad 0x03FC6528B8EFA5D16  # 0.174394078077 195
+    .quad 0x03FC6528B8EFA5D16
+    .quad 0x03FC66A5D42A3AD33  # 0.175120980777 196
+    .quad 0x03FC66A5D42A3AD33
+    .quad 0x03FC686F85BAD4298  # 0.175993962063 197
+    .quad 0x03FC686F85BAD4298
+    .quad 0x03FC6A399DABBD383  # 0.176867706111 198
+    .quad 0x03FC6A399DABBD383
+    .quad 0x03FC6BB7AA9F22C40  # 0.177596409780 199
+    .quad 0x03FC6BB7AA9F22C40
+    .quad 0x03FC6D827EB7C1E57  # 0.178471555693 200
+    .quad 0x03FC6D827EB7C1E57
+    .quad 0x03FC6F0128B756AB9  # 0.179201429458 201
+    .quad 0x03FC6F0128B756AB9
+    .quad 0x03FC70CCB9927BCF6  # 0.180077981742 202
+    .quad 0x03FC70CCB9927BCF6
+    .quad 0x03FC7298B1A4E32B6  # 0.180955303044 203
+    .quad 0x03FC7298B1A4E32B6
+    .quad 0x03FC74184F58CC7DC  # 0.181686992547 204
+    .quad 0x03FC74184F58CC7DC
+    .quad 0x03FC75E5051E74141  # 0.182565727226 205
+    .quad 0x03FC75E5051E74141
+    .quad 0x03FC77654128F6127  # 0.183298596442 206
+    .quad 0x03FC77654128F6127
+    .quad 0x03FC7932B53E97639  # 0.184178749058 207
+    .quad 0x03FC7932B53E97639
+    .quad 0x03FC7AB390229D8FD  # 0.184912801796 208
+    .quad 0x03FC7AB390229D8FD
+    .quad 0x03FC7C81C325B4A5E  # 0.185794376934 209
+    .quad 0x03FC7C81C325B4A5E
+    .quad 0x03FC7E033D66CD24A  # 0.186529617023 210
+    .quad 0x03FC7E033D66CD24A
+    .quad 0x03FC7FD22FF599D4C  # 0.187412619288 211
+    .quad 0x03FC7FD22FF599D4C
+    .quad 0x03FC81544A17F67C1  # 0.188149050576 212
+    .quad 0x03FC81544A17F67C1
+    .quad 0x03FC8323FCD17DAC8  # 0.189033484595 213
+    .quad 0x03FC8323FCD17DAC8
+    .quad 0x03FC84A6B759F512D  # 0.189771110947 214
+    .quad 0x03FC84A6B759F512D
+    .quad 0x03FC86772ADE0201C  # 0.190656981373 215
+    .quad 0x03FC86772ADE0201C
+    .quad 0x03FC87FA865210911  # 0.191395806674 216
+    .quad 0x03FC87FA865210911
+    .quad 0x03FC89CBBB4136201  # 0.192283118179 217
+    .quad 0x03FC89CBBB4136201
+    .quad 0x03FC8B4FB826FF291  # 0.193023146334 218
+    .quad 0x03FC8B4FB826FF291
+    .quad 0x03FC8D21AF2299298  # 0.193911903613 219
+    .quad 0x03FC8D21AF2299298
+    .quad 0x03FC8EA64E00E7FC0  # 0.194653138545 220
+    .quad 0x03FC8EA64E00E7FC0
+    .quad 0x03FC902B36AB7681D  # 0.195394923313 221
+    .quad 0x03FC902B36AB7681D
+    .quad 0x03FC91FE49096581E  # 0.196285791969 222
+    .quad 0x03FC91FE49096581E
+    .quad 0x03FC9383D471B869B  # 0.197028789254 223
+    .quad 0x03FC9383D471B869B
+    .quad 0x03FC9557AA6B87F65  # 0.197921115309 224
+    .quad 0x03FC9557AA6B87F65
+    .quad 0x03FC96DDD91A0B959  # 0.198665329082 225
+    .quad 0x03FC96DDD91A0B959
+    .quad 0x03FC9864522D04491  # 0.199410097121 226
+    .quad 0x03FC9864522D04491
+    .quad 0x03FC9A3945D1A44B3  # 0.200304551564 227
+    .quad 0x03FC9A3945D1A44B3
+    .quad 0x03FC9BC062F26FC3B  # 0.201050541900 228
+    .quad 0x03FC9BC062F26FC3B
+    .quad 0x03FC9D47CAD2C1871  # 0.201797089154 229
+    .quad 0x03FC9D47CAD2C1871
+    .quad 0x03FC9F1DDD7FE4F8B  # 0.202693682161 230
+    .quad 0x03FC9F1DDD7FE4F8B
+    .quad 0x03FCA0A5EA371A910  # 0.203441457564 231
+    .quad 0x03FCA0A5EA371A910
+    .quad 0x03FCA22E42098F498  # 0.204189792554 232
+    .quad 0x03FCA22E42098F498
+    .quad 0x03FCA405751F6CCE4  # 0.205088534376 233
+    .quad 0x03FCA405751F6CCE4
+    .quad 0x03FCA58E729348F40  # 0.205838103409 234
+    .quad 0x03FCA58E729348F40
+    .quad 0x03FCA717BB7EC64A3  # 0.206588234717 235
+    .quad 0x03FCA717BB7EC64A3
+    .quad 0x03FCA8F010601E5FD  # 0.207489135679 236
+    .quad 0x03FCA8F010601E5FD
+    .quad 0x03FCAA79FFB8FCD48  # 0.208240506966 237
+    .quad 0x03FCAA79FFB8FCD48
+    .quad 0x03FCAC043AE68965A  # 0.208992443238 238
+    .quad 0x03FCAC043AE68965A
+    .quad 0x03FCAD8EC205FB6AD  # 0.209744945343 239
+    .quad 0x03FCAD8EC205FB6AD
+    .quad 0x03FCAF6895610DBAD  # 0.210648695969 240
+    .quad 0x03FCAF6895610DBAD
+    .quad 0x03FCB0F3C3FBD65C9  # 0.211402445910 241
+    .quad 0x03FCB0F3C3FBD65C9
+    .quad 0x03FCB27F3EE674219  # 0.212156764419 242
+    .quad 0x03FCB27F3EE674219
+    .quad 0x03FCB40B063E65B0F  # 0.212911652354 243
+    .quad 0x03FCB40B063E65B0F
+    .quad 0x03FCB5E65A8096C88  # 0.213818270730 244
+    .quad 0x03FCB5E65A8096C88
+    .quad 0x03FCB772CA646760C  # 0.214574414434 245
+    .quad 0x03FCB772CA646760C
+    .quad 0x03FCB8FF871461198  # 0.215331130323 246
+    .quad 0x03FCB8FF871461198
+    .quad 0x03FCBA8C90AE4AD19  # 0.216088419265 247
+    .quad 0x03FCBA8C90AE4AD19
+    .quad 0x03FCBC19E74FFCBDA  # 0.216846282128 248
+    .quad 0x03FCBC19E74FFCBDA
+    .quad 0x03FCBDF71B83DAE7A  # 0.217756476365 249
+    .quad 0x03FCBDF71B83DAE7A
+    .quad 0x03FCBF851C067555C  # 0.218515604922 250
+    .quad 0x03FCBF851C067555C
+    .quad 0x03FCC11369F0CDB3C  # 0.219275310193 251
+    .quad 0x03FCC11369F0CDB3C
+    .quad 0x03FCC2A205610593E  # 0.220035593055 252
+    .quad 0x03FCC2A205610593E
+    .quad 0x03FCC430EE755023B  # 0.220796454387 253
+    .quad 0x03FCC430EE755023B
+    .quad 0x03FCC5C0254BF23A8  # 0.221557895069 254
+    .quad 0x03FCC5C0254BF23A8
+    .quad 0x03FCC79F9AB632BF1  # 0.222472389875 255
+    .quad 0x03FCC79F9AB632BF1
+    .quad 0x03FCC92F7D09ABE20  # 0.223235108240 256
+    .quad 0x03FCC92F7D09ABE20
+    .quad 0x03FCCABFAD80D023D  # 0.223998408788 257
+    .quad 0x03FCCABFAD80D023D
+    .quad 0x03FCCC502C3A2F1E8  # 0.224762292410 258
+    .quad 0x03FCCC502C3A2F1E8
+    .quad 0x03FCCDE0F9546A5E7  # 0.225526759995 259
+    .quad 0x03FCCDE0F9546A5E7
+    .quad 0x03FCCF7214EE356E9  # 0.226291812439 260
+    .quad 0x03FCCF7214EE356E9
+    .quad 0x03FCD1037F2655E7B  # 0.227057450635 261
+    .quad 0x03FCD1037F2655E7B
+    .quad 0x03FCD295381BA37E9  # 0.227823675483 262
+    .quad 0x03FCD295381BA37E9
+    .quad 0x03FCD4273FED08111  # 0.228590487882 263
+    .quad 0x03FCD4273FED08111
+    .quad 0x03FCD5B996B97FB5F  # 0.229357888733 264
+    .quad 0x03FCD5B996B97FB5F
+    .quad 0x03FCD74C3CA018C9C  # 0.230125878940 265
+    .quad 0x03FCD74C3CA018C9C
+    .quad 0x03FCD8DF31BFF3FF2  # 0.230894459410 266
+    .quad 0x03FCD8DF31BFF3FF2
+    .quad 0x03FCDA727638446A1  # 0.231663631050 267
+    .quad 0x03FCDA727638446A1
+    .quad 0x03FCDC56CAE452F5B  # 0.232587418645 268
+    .quad 0x03FCDC56CAE452F5B
+    .quad 0x03FCDDEABE5A3926E  # 0.233357894066 269
+    .quad 0x03FCDDEABE5A3926E
+    .quad 0x03FCDF7F018CE771F  # 0.234128963578 270
+    .quad 0x03FCDF7F018CE771F
+    .quad 0x03FCE113949BDEC62  # 0.234900628096 271
+    .quad 0x03FCE113949BDEC62
+    .quad 0x03FCE2A877A6B2C0F  # 0.235672888541 272
+    .quad 0x03FCE2A877A6B2C0F
+    .quad 0x03FCE43DAACD09BEC  # 0.236445745833 273
+    .quad 0x03FCE43DAACD09BEC
+    .quad 0x03FCE5D32E2E9CE87  # 0.237219200895 274
+    .quad 0x03FCE5D32E2E9CE87
+    .quad 0x03FCE76901EB38427  # 0.237993254653 275
+    .quad 0x03FCE76901EB38427
+    .quad 0x03FCE8ADE53F76866  # 0.238612929343 276
+    .quad 0x03FCE8ADE53F76866
+    .quad 0x03FCEA4449F04AAF4  # 0.239388063093 277
+    .quad 0x03FCEA4449F04AAF4
+    .quad 0x03FCEBDAFF5593E99  # 0.240163798141 278
+    .quad 0x03FCEBDAFF5593E99
+    .quad 0x03FCED72058F666C5  # 0.240940135421 279
+    .quad 0x03FCED72058F666C5
+    .quad 0x03FCEF095CBDE9937  # 0.241717075868 280
+    .quad 0x03FCEF095CBDE9937
+    .quad 0x03FCF0A1050157ED6  # 0.242494620422 281
+    .quad 0x03FCF0A1050157ED6
+    .quad 0x03FCF238FE79FF4BF  # 0.243272770021 282
+    .quad 0x03FCF238FE79FF4BF
+    .quad 0x03FCF3D1494840D2F  # 0.244051525609 283
+    .quad 0x03FCF3D1494840D2F
+    .quad 0x03FCF569E58C91077  # 0.244830888130 284
+    .quad 0x03FCF569E58C91077
+    .quad 0x03FCF702D36777DF0  # 0.245610858531 285
+    .quad 0x03FCF702D36777DF0
+    .quad 0x03FCF89C12F990D0C  # 0.246391437760 286
+    .quad 0x03FCF89C12F990D0C
+    .quad 0x03FCFA35A4638AE2C  # 0.247172626770 287
+    .quad 0x03FCFA35A4638AE2C
+    .quad 0x03FCFB7D86EEE3B92  # 0.247798017660 288
+    .quad 0x03FCFB7D86EEE3B92
+    .quad 0x03FCFD17ABFCDB683  # 0.248580306677 289
+    .quad 0x03FCFD17ABFCDB683
+    .quad 0x03FCFEB2233EA07CB  # 0.249363208150 290
+    .quad 0x03FCFEB2233EA07CB
+    .quad 0x03FD0026766A9671C  # 0.250146723037 291
+    .quad 0x03FD0026766A9671C
+    .quad 0x03FD00F40470C7323  # 0.250930852302 292
+    .quad 0x03FD00F40470C7323
+    .quad 0x03FD01C1BBC2735A3  # 0.251715596908 293
+    .quad 0x03FD01C1BBC2735A3
+    .quad 0x03FD028F9C7035C1D  # 0.252500957822 294
+    .quad 0x03FD028F9C7035C1D
+    .quad 0x03FD03346E0106062  # 0.253129690945 295
+    .quad 0x03FD03346E0106062
+    .quad 0x03FD0402994B4F041  # 0.253916163656 296
+    .quad 0x03FD0402994B4F041
+    .quad 0x03FD04D0EE20620AF  # 0.254703255393 297
+    .quad 0x03FD04D0EE20620AF
+    .quad 0x03FD059F6C910034D  # 0.255490967131 298
+    .quad 0x03FD059F6C910034D
+    .quad 0x03FD066E14ADF4BFD  # 0.256279299848 299
+    .quad 0x03FD066E14ADF4BFD
+    .quad 0x03FD07138604D5864  # 0.256910413785 300
+    .quad 0x03FD07138604D5864
+    .quad 0x03FD07E2794F3E8C1  # 0.257699866735 301
+    .quad 0x03FD07E2794F3E8C1
+    .quad 0x03FD08B196753A125  # 0.258489943414 302
+    .quad 0x03FD08B196753A125
+    .quad 0x03FD0980DD87BA2DD  # 0.259280644807 303
+    .quad 0x03FD0980DD87BA2DD
+    .quad 0x03FD0A504E97BB40C  # 0.260071971904 304
+    .quad 0x03FD0A504E97BB40C
+    .quad 0x03FD0AF660EB9E278  # 0.260705484754 305
+    .quad 0x03FD0AF660EB9E278
+    .quad 0x03FD0BC61DBBA97CB  # 0.261497940616 306
+    .quad 0x03FD0BC61DBBA97CB
+    .quad 0x03FD0C9604B8FC51E  # 0.262291024962 307
+    .quad 0x03FD0C9604B8FC51E
+    .quad 0x03FD0D3C7586CD5E5  # 0.262925945618 308
+    .quad 0x03FD0D3C7586CD5E5
+    .quad 0x03FD0E0CA89A72D29  # 0.263720163752 309
+    .quad 0x03FD0E0CA89A72D29
+    .quad 0x03FD0EDD060B78082  # 0.264515013170 310
+    .quad 0x03FD0EDD060B78082
+    .quad 0x03FD0FAD8DEB1E2C0  # 0.265310494876 311
+    .quad 0x03FD0FAD8DEB1E2C0
+    .quad 0x03FD10547F9D26ABC  # 0.265947336165 312
+    .quad 0x03FD10547F9D26ABC
+    .quad 0x03FD1125540925114  # 0.266743958529 313
+    .quad 0x03FD1125540925114
+    .quad 0x03FD11F653144CB8B  # 0.267541216005 314
+    .quad 0x03FD11F653144CB8B
+    .quad 0x03FD129DA43F5BE9E  # 0.268179479949 315
+    .quad 0x03FD129DA43F5BE9E
+    .quad 0x03FD136EF02E8290C  # 0.268977883185 316
+    .quad 0x03FD136EF02E8290C
+    .quad 0x03FD144066EDAE406  # 0.269776924378 317
+    .quad 0x03FD144066EDAE406
+    .quad 0x03FD14E817FF359D7  # 0.270416617347 318
+    .quad 0x03FD14E817FF359D7
+    .quad 0x03FD15B9DBFA9DEC8  # 0.271216809436 319
+    .quad 0x03FD15B9DBFA9DEC8
+    .quad 0x03FD168BCAF73B3EB  # 0.272017642345 320
+    .quad 0x03FD168BCAF73B3EB
+    .quad 0x03FD1733DC5D68DE8  # 0.272658770753 321
+    .quad 0x03FD1733DC5D68DE8
+    .quad 0x03FD180618EF18ADE  # 0.273460759729 322
+    .quad 0x03FD180618EF18ADE
+    .quad 0x03FD18D880B3826FE  # 0.274263392407 323
+    .quad 0x03FD18D880B3826FE
+    .quad 0x03FD1980F2DD42B6F  # 0.274905962710 324
+    .quad 0x03FD1980F2DD42B6F
+    .quad 0x03FD1A53A8902E70B  # 0.275709756661 325
+    .quad 0x03FD1A53A8902E70B
+    .quad 0x03FD1AFC59297024D  # 0.276353257326 326
+    .quad 0x03FD1AFC59297024D
+    .quad 0x03FD1BCF5D04AE1EA  # 0.277158215914 327
+    .quad 0x03FD1BCF5D04AE1EA
+    .quad 0x03FD1CA28C64BAE54  # 0.277963822983 328
+    .quad 0x03FD1CA28C64BAE54
+    .quad 0x03FD1D4B9E796C245  # 0.278608776246 329
+    .quad 0x03FD1D4B9E796C245
+    .quad 0x03FD1E1F1C5C3A06C  # 0.279415553216 330
+    .quad 0x03FD1E1F1C5C3A06C
+    .quad 0x03FD1EC86D5747AAD  # 0.280061443760 331
+    .quad 0x03FD1EC86D5747AAD
+    .quad 0x03FD1F9C39F74C559  # 0.280869394034 332
+    .quad 0x03FD1F9C39F74C559
+    .quad 0x03FD2070326F1F789  # 0.281677997620 333
+    .quad 0x03FD2070326F1F789
+    .quad 0x03FD2119E59F8789C  # 0.282325351583 334
+    .quad 0x03FD2119E59F8789C
+    .quad 0x03FD21EE2D300381C  # 0.283135133796 335
+    .quad 0x03FD21EE2D300381C
+    .quad 0x03FD22981FBEF797A  # 0.283783432036 336
+    .quad 0x03FD22981FBEF797A
+    .quad 0x03FD236CB6A339EED  # 0.284594396317 337
+    .quad 0x03FD236CB6A339EED
+    .quad 0x03FD2416E8C01F606  # 0.285243641592 338
+    .quad 0x03FD2416E8C01F606
+    .quad 0x03FD24EBCF3387FF6  # 0.286055791397 339
+    .quad 0x03FD24EBCF3387FF6
+    .quad 0x03FD2596410DF963A  # 0.286705986479 340
+    .quad 0x03FD2596410DF963A
+    .quad 0x03FD266B774C2AF55  # 0.287519325279 341
+    .quad 0x03FD266B774C2AF55
+    .quad 0x03FD27162913F873F  # 0.288170472950 342
+    .quad 0x03FD27162913F873F
+    .quad 0x03FD27EBAF58D8C9C  # 0.288985004232 343
+    .quad 0x03FD27EBAF58D8C9C
+    .quad 0x03FD2896A13E086A3  # 0.289637107288 344
+    .quad 0x03FD2896A13E086A3
+    .quad 0x03FD296C77C5C0E13  # 0.290452834554 345
+    .quad 0x03FD296C77C5C0E13
+    .quad 0x03FD2A17A9F88EDD2  # 0.291105895801 346
+    .quad 0x03FD2A17A9F88EDD2
+    .quad 0x03FD2AEDD0FF8CC2C  # 0.291922822568 347
+    .quad 0x03FD2AEDD0FF8CC2C
+    .quad 0x03FD2B9943B06BD77  # 0.292576844829 348
+    .quad 0x03FD2B9943B06BD77
+    .quad 0x03FD2C6FBB7360D0E  # 0.293394974630 349
+    .quad 0x03FD2C6FBB7360D0E
+    .quad 0x03FD2D1B6ED2FA90C  # 0.294049960734 350
+    .quad 0x03FD2D1B6ED2FA90C
+    .quad 0x03FD2DC73F01B0DD4  # 0.294705376127 351
+    .quad 0x03FD2DC73F01B0DD4
+    .quad 0x03FD2E9E2BCE12286  # 0.295525249913 352
+    .quad 0x03FD2E9E2BCE12286
+    .quad 0x03FD2F4A3CF22EDC2  # 0.296181633264 353
+    .quad 0x03FD2F4A3CF22EDC2
+    .quad 0x03FD30217B1006601  # 0.297002718785 354
+    .quad 0x03FD30217B1006601
+    .quad 0x03FD30CDCD5ABA762  # 0.297660072959 355
+    .quad 0x03FD30CDCD5ABA762
+    .quad 0x03FD31A55D07A8590  # 0.298482373803 356
+    .quad 0x03FD31A55D07A8590
+    .quad 0x03FD3251F0AA5CC1A  # 0.299140701674 357
+    .quad 0x03FD3251F0AA5CC1A
+    .quad 0x03FD32FEA167A6D70  # 0.299799463226 358
+    .quad 0x03FD32FEA167A6D70
+    .quad 0x03FD33D6A7509D491  # 0.300623525901 359
+    .quad 0x03FD33D6A7509D491
+    .quad 0x03FD348399ADA9D94  # 0.301283265328 360
+    .quad 0x03FD348399ADA9D94
+    .quad 0x03FD3530A9454ADC9  # 0.301943440298 361
+    .quad 0x03FD3530A9454ADC9
+    .quad 0x03FD360925EC44F5C  # 0.302769272371 362
+    .quad 0x03FD360925EC44F5C
+    .quad 0x03FD36B6776BE1116  # 0.303430429420 363
+    .quad 0x03FD36B6776BE1116
+    .quad 0x03FD378F469437FB4  # 0.304257490918 364
+    .quad 0x03FD378F469437FB4
+    .quad 0x03FD383CDA2E14ECB  # 0.304919632971 365
+    .quad 0x03FD383CDA2E14ECB
+    .quad 0x03FD38EA8B3924521  # 0.305582213748 366
+    .quad 0x03FD38EA8B3924521
+    .quad 0x03FD39C3D1FD60E74  # 0.306411057558 367
+    .quad 0x03FD39C3D1FD60E74
+    .quad 0x03FD3A71C56BB48C7  # 0.307074627589 368
+    .quad 0x03FD3A71C56BB48C7
+    .quad 0x03FD3B1FD66BC8D10  # 0.307738638238 369
+    .quad 0x03FD3B1FD66BC8D10
+    .quad 0x03FD3BF995502CB5C  # 0.308569272059 370
+    .quad 0x03FD3BF995502CB5C
+    .quad 0x03FD3CA7E8FD01DF6  # 0.309234276240 371
+    .quad 0x03FD3CA7E8FD01DF6
+    .quad 0x03FD3D565A5C5BF11  # 0.309899722945 372
+    .quad 0x03FD3D565A5C5BF11
+    .quad 0x03FD3E3091E6049FB  # 0.310732154526 373
+    .quad 0x03FD3E3091E6049FB
+    .quad 0x03FD3EDF463C1683E  # 0.311398599069 374
+    .quad 0x03FD3EDF463C1683E
+    .quad 0x03FD3F8E1865A82DD  # 0.312065488057 375
+    .quad 0x03FD3F8E1865A82DD
+    .quad 0x03FD403D086CEA79B  # 0.312732822082 376
+    .quad 0x03FD403D086CEA79B
+    .quad 0x03FD4117DE854CA15  # 0.313567616354 377
+    .quad 0x03FD4117DE854CA15
+    .quad 0x03FD41C711E4BA15E  # 0.314235953889 378
+    .quad 0x03FD41C711E4BA15E
+    .quad 0x03FD427663431B221  # 0.314904738398 379
+    .quad 0x03FD427663431B221
+    .quad 0x03FD4325D2AAB6F18  # 0.315573970480 380
+    .quad 0x03FD4325D2AAB6F18
+    .quad 0x03FD44014838E5513  # 0.316411140893 381
+    .quad 0x03FD44014838E5513
+    .quad 0x03FD44B0FB5AF4F44  # 0.317081382205 382
+    .quad 0x03FD44B0FB5AF4F44
+    .quad 0x03FD4560CCA7CB3B2  # 0.317752073041 383
+    .quad 0x03FD4560CCA7CB3B2
+    .quad 0x03FD4610BC29C5E18  # 0.318423214006 384
+    .quad 0x03FD4610BC29C5E18
+    .quad 0x03FD46ECD216CDCB5  # 0.319262774126 385
+    .quad 0x03FD46ECD216CDCB5
+    .quad 0x03FD479D05B65CB60  # 0.319934930091 386
+    .quad 0x03FD479D05B65CB60
+    .quad 0x03FD484D57ACE5A1A  # 0.320607538154 387
+    .quad 0x03FD484D57ACE5A1A
+    .quad 0x03FD48FDC804DD1CB  # 0.321280598924 388
+    .quad 0x03FD48FDC804DD1CB
+    .quad 0x03FD49DA7F3BCC420  # 0.322122562432 389
+    .quad 0x03FD49DA7F3BCC420
+    .quad 0x03FD4A8B341552B09  # 0.322796644021 390
+    .quad 0x03FD4A8B341552B09
+    .quad 0x03FD4B3C077267E9A  # 0.323471180303 391
+    .quad 0x03FD4B3C077267E9A
+    .quad 0x03FD4BECF95D97914  # 0.324146171892 392
+    .quad 0x03FD4BECF95D97914
+    .quad 0x03FD4C9E09E172C3D  # 0.324821619401 393
+    .quad 0x03FD4C9E09E172C3D
+    .quad 0x03FD4D4F3908901A0  # 0.325497523449 394
+    .quad 0x03FD4D4F3908901A0
+    .quad 0x03FD4E2CDF1F341C1  # 0.326343046455 395
+    .quad 0x03FD4E2CDF1F341C1
+    .quad 0x03FD4EDE535C79642  # 0.327019979972 396
+    .quad 0x03FD4EDE535C79642
+    .quad 0x03FD4F8FE65F90500  # 0.327697372039 397
+    .quad 0x03FD4F8FE65F90500
+    .quad 0x03FD5041983326F2D  # 0.328375223276 398
+    .quad 0x03FD5041983326F2D
+    .quad 0x03FD50F368E1F0F02  # 0.329053534308 399
+    .quad 0x03FD50F368E1F0F02
+    .quad 0x03FD51A55876A77F5  # 0.329732305758 400
+    .quad 0x03FD51A55876A77F5
+    .quad 0x03FD5283EF743F98B  # 0.330581418486 401
+    .quad 0x03FD5283EF743F98B
+    .quad 0x03FD533624B59CA35  # 0.331261228165 402
+    .quad 0x03FD533624B59CA35
+    .quad 0x03FD53E878FFE6EAE  # 0.331941500300 403
+    .quad 0x03FD53E878FFE6EAE
+    .quad 0x03FD549AEC5DEF880  # 0.332622235521 404
+    .quad 0x03FD549AEC5DEF880
+    .quad 0x03FD554D7EDA8D3C4  # 0.333303434457 405
+    .quad 0x03FD554D7EDA8D3C4
+    .quad 0x03FD560030809C759  # 0.333985097742 406
+    .quad 0x03FD560030809C759
+    .quad 0x03FD56B3015AFF52C  # 0.334667226008 407
+    .quad 0x03FD56B3015AFF52C
+    .quad 0x03FD5765F1749DA6C  # 0.335349819892 408
+    .quad 0x03FD5765F1749DA6C
+    .quad 0x03FD581900D864FD7  # 0.336032880027 409
+    .quad 0x03FD581900D864FD7
+    .quad 0x03FD58CC2F91489F5  # 0.336716407053 410
+    .quad 0x03FD58CC2F91489F5
+    .quad 0x03FD59AC5618CCE38  # 0.337571473373 411
+    .quad 0x03FD59AC5618CCE38
+    .quad 0x03FD5A5FCB795780C  # 0.338256053239 412
+    .quad 0x03FD5A5FCB795780C
+    .quad 0x03FD5B136052BCE39  # 0.338941102075 413
+    .quad 0x03FD5B136052BCE39
+    .quad 0x03FD5BC714B008E23  # 0.339626620526 414
+    .quad 0x03FD5BC714B008E23
+    .quad 0x03FD5C7AE89C4D254  # 0.340312609234 415
+    .quad 0x03FD5C7AE89C4D254
+    .quad 0x03FD5D2EDC22A12BA  # 0.340999068845 416
+    .quad 0x03FD5D2EDC22A12BA
+    .quad 0x03FD5DE2EF4E224D6  # 0.341686000008 417
+    .quad 0x03FD5DE2EF4E224D6
+    .quad 0x03FD5E972229F3C15  # 0.342373403369 418
+    .quad 0x03FD5E972229F3C15
+    .quad 0x03FD5F4B74C13EA04  # 0.343061279578 419
+    .quad 0x03FD5F4B74C13EA04
+    .quad 0x03FD5FFFE71F31E9A  # 0.343749629287 420
+    .quad 0x03FD5FFFE71F31E9A
+    .quad 0x03FD60B4794F02875  # 0.344438453147 421
+    .quad 0x03FD60B4794F02875
+    .quad 0x03FD61692B5BEB520  # 0.345127751813 422
+    .quad 0x03FD61692B5BEB520
+    .quad 0x03FD621DFD512D14F  # 0.345817525940 423
+    .quad 0x03FD621DFD512D14F
+    .quad 0x03FD62D2EF3A0E933  # 0.346507776183 424
+    .quad 0x03FD62D2EF3A0E933
+    .quad 0x03FD63880121DC8AB  # 0.347198503200 425
+    .quad 0x03FD63880121DC8AB
+    .quad 0x03FD643D3313E9B92  # 0.347889707652 426
+    .quad 0x03FD643D3313E9B92
+    .quad 0x03FD64F2851B8EE01  # 0.348581390197 427
+    .quad 0x03FD64F2851B8EE01
+    .quad 0x03FD65A7F7442AC90  # 0.349273551498 428
+    .quad 0x03FD65A7F7442AC90
+    .quad 0x03FD665D8999224A5  # 0.349966192218 429
+    .quad 0x03FD665D8999224A5
+    .quad 0x03FD67133C25E04A5  # 0.350659313022 430
+    .quad 0x03FD67133C25E04A5
+    .quad 0x03FD67C90EF5D5C4C  # 0.351352914576 431
+    .quad 0x03FD67C90EF5D5C4C
+    .quad 0x03FD687F021479CEE  # 0.352046997547 432
+    .quad 0x03FD687F021479CEE
+    .quad 0x03FD6935158D499B3  # 0.352741562603 433
+    .quad 0x03FD6935158D499B3
+    .quad 0x03FD69EB496BC87E5  # 0.353436610416 434
+    .quad 0x03FD69EB496BC87E5
+    .quad 0x03FD6AA19DBB7FF34  # 0.354132141656 435
+    .quad 0x03FD6AA19DBB7FF34
+    .quad 0x03FD6B581287FF9FD  # 0.354828156996 436
+    .quad 0x03FD6B581287FF9FD
+    .quad 0x03FD6C0EA7DCDD591  # 0.355524657112 437
+    .quad 0x03FD6C0EA7DCDD591
+    .quad 0x03FD6C97AD3CFCFD9  # 0.356047350738 438
+    .quad 0x03FD6C97AD3CFCFD9
+    .quad 0x03FD6D4E7B9C727EC  # 0.356744700836 439
+    .quad 0x03FD6D4E7B9C727EC
+    .quad 0x03FD6E056AA4421D6  # 0.357442537571 440
+    .quad 0x03FD6E056AA4421D6
+    .quad 0x03FD6EBC7A6019066  # 0.358140861621 441
+    .quad 0x03FD6EBC7A6019066
+    .quad 0x03FD6F73AADBAAAB7  # 0.358839673669 442
+    .quad 0x03FD6F73AADBAAAB7
+    .quad 0x03FD702AFC22B0C6D  # 0.359538974397 443
+    .quad 0x03FD702AFC22B0C6D
+    .quad 0x03FD70E26E40EB5FA  # 0.360238764489 444
+    .quad 0x03FD70E26E40EB5FA
+    .quad 0x03FD719A014220CF5  # 0.360939044629 445
+    .quad 0x03FD719A014220CF5
+    .quad 0x03FD7251B5321DC54  # 0.361639815506 446
+    .quad 0x03FD7251B5321DC54
+    .quad 0x03FD73098A1CB54BA  # 0.362341077807 447
+    .quad 0x03FD73098A1CB54BA
+    .quad 0x03FD73937F783CEBA  # 0.362867347444 448
+    .quad 0x03FD73937F783CEBA
+    .quad 0x03FD744B8E35E9EDA  # 0.363569471398 449
+    .quad 0x03FD744B8E35E9EDA
+    .quad 0x03FD7503BE0ED6C66  # 0.364272088676 450
+    .quad 0x03FD7503BE0ED6C66
+    .quad 0x03FD75BC0F0EEE7DE  # 0.364975199972 451
+    .quad 0x03FD75BC0F0EEE7DE
+    .quad 0x03FD76748142228C7  # 0.365678805982 452
+    .quad 0x03FD76748142228C7
+    .quad 0x03FD772D14B46AE00  # 0.366382907402 453
+    .quad 0x03FD772D14B46AE00
+    .quad 0x03FD77E5C971C5E06  # 0.367087504930 454
+    .quad 0x03FD77E5C971C5E06
+    .quad 0x03FD787066E04915F  # 0.367616279067 455
+    .quad 0x03FD787066E04915F
+    .quad 0x03FD792955FDF47A3  # 0.368321746469 456
+    .quad 0x03FD792955FDF47A3
+    .quad 0x03FD79E26687CFB3D  # 0.369027711906 457
+    .quad 0x03FD79E26687CFB3D
+    .quad 0x03FD7A9B9889F19E2  # 0.369734176082 458
+    .quad 0x03FD7A9B9889F19E2
+    .quad 0x03FD7B54EC1077A48  # 0.370441139703 459
+    .quad 0x03FD7B54EC1077A48
+    .quad 0x03FD7C0E612785C74  # 0.371148603475 460
+    .quad 0x03FD7C0E612785C74
+    .quad 0x03FD7C998F06FB152  # 0.371679529954 461
+    .quad 0x03FD7C998F06FB152
+    .quad 0x03FD7D533EF841E8A  # 0.372387870696 462
+    .quad 0x03FD7D533EF841E8A
+    .quad 0x03FD7E0D109B95F19  # 0.373096713539 463
+    .quad 0x03FD7E0D109B95F19
+    .quad 0x03FD7EC703FD340AA  # 0.373806059198 464
+    .quad 0x03FD7EC703FD340AA
+    .quad 0x03FD7F8119295FB9B  # 0.374515908385 465
+    .quad 0x03FD7F8119295FB9B
+    .quad 0x03FD800CBF3ED1CC2  # 0.375048626146 466
+    .quad 0x03FD800CBF3ED1CC2
+    .quad 0x03FD80C70FAB0BDF6  # 0.375759358229 467
+    .quad 0x03FD80C70FAB0BDF6
+    .quad 0x03FD81818203AFC7F  # 0.376470595813 468
+    .quad 0x03FD81818203AFC7F
+    .quad 0x03FD823C16551A3C3  # 0.377182339615 469
+    .quad 0x03FD823C16551A3C3
+    .quad 0x03FD82C81BE4DFF4A  # 0.377716480107 470
+    .quad 0x03FD82C81BE4DFF4A
+    .quad 0x03FD8382EBC7794D1  # 0.378429111528 471
+    .quad 0x03FD8382EBC7794D1
+    .quad 0x03FD843DDDC4FB137  # 0.379142251156 472
+    .quad 0x03FD843DDDC4FB137
+    .quad 0x03FD84F8F1E9DB72B  # 0.379855899714 473
+    .quad 0x03FD84F8F1E9DB72B
+    .quad 0x03FD85855776DCBFB  # 0.380391470556 474
+    .quad 0x03FD85855776DCBFB
+    .quad 0x03FD8640A77EB3957  # 0.381106011494 475
+    .quad 0x03FD8640A77EB3957
+    .quad 0x03FD86FC19D05148E  # 0.381821063366 476
+    .quad 0x03FD86FC19D05148E
+    .quad 0x03FD87B7AE7845C0F  # 0.382536626902 477
+    .quad 0x03FD87B7AE7845C0F
+    .quad 0x03FD8844748678822  # 0.383073635776 478
+    .quad 0x03FD8844748678822
+    .quad 0x03FD89004563D3DFD  # 0.383790096491 479
+    .quad 0x03FD89004563D3DFD
+    .quad 0x03FD89BC38BA356B4  # 0.384507070890 480
+    .quad 0x03FD89BC38BA356B4
+    .quad 0x03FD8A4945E20894E  # 0.385045139237 481
+    .quad 0x03FD8A4945E20894E
+    .quad 0x03FD8B0575AAB1FC5  # 0.385763014358 482
+    .quad 0x03FD8B0575AAB1FC5
+    .quad 0x03FD8BC1C80F45A32  # 0.386481405193 483
+    .quad 0x03FD8BC1C80F45A32
+    .quad 0x03FD8C7E3D1C80B2F  # 0.387200312485 484
+    .quad 0x03FD8C7E3D1C80B2F
+    .quad 0x03FD8D0BABACC89EE  # 0.387739832326 485
+    .quad 0x03FD8D0BABACC89EE
+    .quad 0x03FD8DC85D7FE5013  # 0.388459645206 486
+    .quad 0x03FD8DC85D7FE5013
+    .quad 0x03FD8E85321ED5598  # 0.389179976589 487
+    .quad 0x03FD8E85321ED5598
+    .quad 0x03FD8F12E873862C7  # 0.389720565845 488
+    .quad 0x03FD8F12E873862C7
+    .quad 0x03FD8FCFFA1614AA0  # 0.390441806410 489
+    .quad 0x03FD8FCFFA1614AA0
+    .quad 0x03FD908D2EA7D9511  # 0.391163567538 490
+    .quad 0x03FD908D2EA7D9511
+    .quad 0x03FD911B2D09ED9D6  # 0.391705230456 491
+    .quad 0x03FD911B2D09ED9D6
+    .quad 0x03FD91D89EDD6B7FF  # 0.392427904381 492
+    .quad 0x03FD91D89EDD6B7FF
+    .quad 0x03FD929633C3B7D3E  # 0.393151100941 493
+    .quad 0x03FD929633C3B7D3E
+    .quad 0x03FD93247A7C99B52  # 0.393693841796 494
+    .quad 0x03FD93247A7C99B52
+    .quad 0x03FD93E24CE3195E8  # 0.394417954789 495
+    .quad 0x03FD93E24CE3195E8
+    .quad 0x03FD9470C1CB1962E  # 0.394961383840 496
+    .quad 0x03FD9470C1CB1962E
+    .quad 0x03FD952ED1D9C0435  # 0.395686415592 497
+    .quad 0x03FD952ED1D9C0435
+    .quad 0x03FD95ED0535EA5D9  # 0.396411973396 498
+    .quad 0x03FD95ED0535EA5D9
+    .quad 0x03FD967BC2EDCCE17  # 0.396956487431 499
+    .quad 0x03FD967BC2EDCCE17
+    .quad 0x03FD973A3431356AE  # 0.397682967666 500
+    .quad 0x03FD973A3431356AE
+    .quad 0x03FD97F8C8E64A1C7  # 0.398409976059 501
+    .quad 0x03FD97F8C8E64A1C7
+    .quad 0x03FD9887CFB8A3932  # 0.398955579419 502
+    .quad 0x03FD9887CFB8A3932
+    .quad 0x03FD9946A2946EF3C  # 0.399683513937 503
+    .quad 0x03FD9946A2946EF3C
+    .quad 0x03FD99D5D8130607C  # 0.400229812776 504
+    .quad 0x03FD99D5D8130607C
+    .quad 0x03FD9A94E93E1EC37  # 0.400958675782 505
+    .quad 0x03FD9A94E93E1EC37
+    .quad 0x03FD9B244D87735E8  # 0.401505671875 506
+    .quad 0x03FD9B244D87735E8
+    .quad 0x03FD9BE39D2A97F0B  # 0.402235465741 507
+    .quad 0x03FD9BE39D2A97F0B
+    .quad 0x03FD9CA3109266E23  # 0.402965792595 508
+    .quad 0x03FD9CA3109266E23
+    .quad 0x03FD9D32BEA15ED3A  # 0.403513887977 509
+    .quad 0x03FD9D32BEA15ED3A
+    .quad 0x03FD9DF270C1914A8  # 0.404245149435 510
+    .quad 0x03FD9DF270C1914A8
+    .quad 0x03FD9E824DEA3E135  # 0.404793946669 511
+    .quad 0x03FD9E824DEA3E135
+    .quad 0x03FD9F423EEBF9DA1  # 0.405526145127 512
+    .quad 0x03FD9F423EEBF9DA1
+    .quad 0x03FD9FD24B4D47012  # 0.406075646011 513
+    .quad 0x03FD9FD24B4D47012
+    .quad 0x03FDA0927B59DA6E2  # 0.406808783874 514
+    .quad 0x03FDA0927B59DA6E2
+    .quad 0x03FDA152CF7F3B46D  # 0.407542459622 515
+    .quad 0x03FDA152CF7F3B46D
+    .quad 0x03FDA1E32653B420E  # 0.408093069896 516
+    .quad 0x03FDA1E32653B420E
+    .quad 0x03FDA2A3B9C527DB1  # 0.408827688845 517
+    .quad 0x03FDA2A3B9C527DB1
+    .quad 0x03FDA33440224FA79  # 0.409379007429 518
+    .quad 0x03FDA33440224FA79
+    .quad 0x03FDA3F513098DD09  # 0.410114572008 519
+    .quad 0x03FDA3F513098DD09
+    .quad 0x03FDA485C90EBDB0C  # 0.410666600728 520
+    .quad 0x03FDA485C90EBDB0C
+    .quad 0x03FDA546DB95A721A  # 0.411403113374 521
+    .quad 0x03FDA546DB95A721A
+    .quad 0x03FDA5D7C16257437  # 0.411955854060 522
+    .quad 0x03FDA5D7C16257437
+    .quad 0x03FDA69913B2F6572  # 0.412693317221 523
+    .quad 0x03FDA69913B2F6572
+    .quad 0x03FDA72A2966BE1EA  # 0.413246771713 524
+    .quad 0x03FDA72A2966BE1EA
+    .quad 0x03FDA7EBBBAB46E8B  # 0.413985187844 525
+    .quad 0x03FDA7EBBBAB46E8B
+    .quad 0x03FDA87D0165DD199  # 0.414539357989 526
+    .quad 0x03FDA87D0165DD199
+    .quad 0x03FDA93ED3C8AD9E3  # 0.415278729556 527
+    .quad 0x03FDA93ED3C8AD9E3
+    .quad 0x03FDA9D049A9E884A  # 0.415833617206 528
+    .quad 0x03FDA9D049A9E884A
+    .quad 0x03FDAA925C5588EFA  # 0.416573946686 529
+    .quad 0x03FDAA925C5588EFA
+    .quad 0x03FDAB24027D5E8AF  # 0.417129553701 530
+    .quad 0x03FDAB24027D5E8AF
+    .quad 0x03FDABE6559C8167C  # 0.417870843580 531
+    .quad 0x03FDABE6559C8167C
+    .quad 0x03FDAC782C2B07944  # 0.418427171828 532
+    .quad 0x03FDAC782C2B07944
+    .quad 0x03FDAD3ABFE88A06E  # 0.419169424599 533
+    .quad 0x03FDAD3ABFE88A06E
+    .quad 0x03FDADCCC6FDF6A80  # 0.419726475955 534
+    .quad 0x03FDADCCC6FDF6A80
+    .quad 0x03FDAE5EE2E961227  # 0.420283837790 535
+    .quad 0x03FDAE5EE2E961227
+    .quad 0x03FDAF21D34189D0A  # 0.421027470470 536
+    .quad 0x03FDAF21D34189D0A
+    .quad 0x03FDAFB41FE2167B4  # 0.421585558104 537
+    .quad 0x03FDAFB41FE2167B4
+    .quad 0x03FDB07751416A7F3  # 0.422330159776 538
+    .quad 0x03FDB07751416A7F3
+    .quad 0x03FDB109CEB79DB8A  # 0.422888975102 539
+    .quad 0x03FDB109CEB79DB8A
+    .quad 0x03FDB1CD41498DF12  # 0.423634548296 540
+    .quad 0x03FDB1CD41498DF12
+    .quad 0x03FDB25FEFB60CB2E  # 0.424194093214 541
+    .quad 0x03FDB25FEFB60CB2E
+    .quad 0x03FDB323A3A63594A  # 0.424940640468 542
+    .quad 0x03FDB323A3A63594A
+    .quad 0x03FDB3B68329C59E9  # 0.425500916886 543
+    .quad 0x03FDB3B68329C59E9
+    .quad 0x03FDB44977C148F1A  # 0.426061507389 544
+    .quad 0x03FDB44977C148F1A
+    .quad 0x03FDB50D895F7773A  # 0.426809450580 545
+    .quad 0x03FDB50D895F7773A
+    .quad 0x03FDB5A0AF3D169CD  # 0.427370775322 546
+    .quad 0x03FDB5A0AF3D169CD
+    .quad 0x03FDB66502A41E541  # 0.428119698779 547
+    .quad 0x03FDB66502A41E541
+    .quad 0x03FDB6F859E8EF639  # 0.428681759684 548
+    .quad 0x03FDB6F859E8EF639
+    .quad 0x03FDB78BC664238C0  # 0.429244136679 549
+    .quad 0x03FDB78BC664238C0
+    .quad 0x03FDB85078123E586  # 0.429994464983 550
+    .quad 0x03FDB85078123E586
+    .quad 0x03FDB8E41624226C5  # 0.430557580905 551
+    .quad 0x03FDB8E41624226C5
+    .quad 0x03FDB9A90A06BCB3D  # 0.431308895742 552
+    .quad 0x03FDB9A90A06BCB3D
+    .quad 0x03FDBA3CD9D0B81BD  # 0.431872752537 553
+    .quad 0x03FDBA3CD9D0B81BD
+    .quad 0x03FDBAD0BEF3DB164  # 0.432436927446 554
+    .quad 0x03FDBAD0BEF3DB164
+    .quad 0x03FDBB9611B80E2FC  # 0.433189656123 555
+    .quad 0x03FDBB9611B80E2FC
+    .quad 0x03FDBC2A28C33B75D  # 0.433754574696 556
+    .quad 0x03FDBC2A28C33B75D
+    .quad 0x03FDBCBE553C2BDDF  # 0.434319812582 557
+    .quad 0x03FDBCBE553C2BDDF
+    .quad 0x03FDBD84073D8EC2B  # 0.435073960430 558
+    .quad 0x03FDBD84073D8EC2B
+    .quad 0x03FDBE1865CEC1EC9  # 0.435639944787 559
+    .quad 0x03FDBE1865CEC1EC9
+    .quad 0x03FDBEACD9E271AD1  # 0.436206249662 560
+    .quad 0x03FDBEACD9E271AD1
+    .quad 0x03FDBF72EB7D20355  # 0.436961822044 561
+    .quad 0x03FDBF72EB7D20355
+    .quad 0x03FDC00791D99132B  # 0.437528876213 562
+    .quad 0x03FDC00791D99132B
+    .quad 0x03FDC09C4DCD565AB  # 0.438096252115 563
+    .quad 0x03FDC09C4DCD565AB
+    .quad 0x03FDC162BF5DF23E4  # 0.438853254422 564
+    .quad 0x03FDC162BF5DF23E4
+    .quad 0x03FDC1F7ADCB3DAB0  # 0.439421382456 565
+    .quad 0x03FDC1F7ADCB3DAB0
+    .quad 0x03FDC28CB1E4D32FD  # 0.439989833442 566
+    .quad 0x03FDC28CB1E4D32FD
+    .quad 0x03FDC35383C8850B0  # 0.440748271097 567
+    .quad 0x03FDC35383C8850B0
+    .quad 0x03FDC3E8BA8CACF27  # 0.441317477070 568
+    .quad 0x03FDC3E8BA8CACF27
+    .quad 0x03FDC47E071233744  # 0.441887007223 569
+    .quad 0x03FDC47E071233744
+    .quad 0x03FDC54539A6ABCD2  # 0.442646885679 570
+    .quad 0x03FDC54539A6ABCD2
+    .quad 0x03FDC5DAB908186FF  # 0.443217173690 571
+    .quad 0x03FDC5DAB908186FF
+    .quad 0x03FDC6704E4016FF7  # 0.443787787115 572
+    .quad 0x03FDC6704E4016FF7
+    .quad 0x03FDC737E1E38F4FB  # 0.444549111857 573
+    .quad 0x03FDC737E1E38F4FB
+    .quad 0x03FDC7CDAA290FEAD  # 0.445120486027 574
+    .quad 0x03FDC7CDAA290FEAD
+    .quad 0x03FDC863885A74D16  # 0.445692186852 575
+    .quad 0x03FDC863885A74D16
+    .quad 0x03FDC8F97C7E299DB  # 0.446264214707 576
+    .quad 0x03FDC8F97C7E299DB
+    .quad 0x03FDC9C18EDC7C26B  # 0.447027427871 577
+    .quad 0x03FDC9C18EDC7C26B
+    .quad 0x03FDCA57B64E9DB05  # 0.447600220249 578
+    .quad 0x03FDCA57B64E9DB05
+    .quad 0x03FDCAEDF3C88A364  # 0.448173340907 579
+    .quad 0x03FDCAEDF3C88A364
+    .quad 0x03FDCB844750B9995  # 0.448746790220 580
+    .quad 0x03FDCB844750B9995
+    .quad 0x03FDCC4CD90B3ECE5  # 0.449511901199 581
+    .quad 0x03FDCC4CD90B3ECE5
+    .quad 0x03FDCCE3602341C10  # 0.450086118843 582
+    .quad 0x03FDCCE3602341C10
+    .quad 0x03FDCD79FD5F2BC77  # 0.450660666403 583
+    .quad 0x03FDCD79FD5F2BC77
+    .quad 0x03FDCE10B0C581284  # 0.451235544257 584
+    .quad 0x03FDCE10B0C581284
+    .quad 0x03FDCED9C27EC6607  # 0.452002562511 585
+    .quad 0x03FDCED9C27EC6607
+    .quad 0x03FDCF70A9B6D3810  # 0.452578212532 586
+    .quad 0x03FDCF70A9B6D3810
+    .quad 0x03FDD007A72F19BBC  # 0.453154194116 587
+    .quad 0x03FDD007A72F19BBC
+    .quad 0x03FDD09EBAEE29DD8  # 0.453730507647 588
+    .quad 0x03FDD09EBAEE29DD8
+    .quad 0x03FDD1684D49F46AE  # 0.454499442710 589
+    .quad 0x03FDD1684D49F46AE
+    .quad 0x03FDD1FF951D1F1B3  # 0.455076532271 590
+    .quad 0x03FDD1FF951D1F1B3
+    .quad 0x03FDD296F34D0B65C  # 0.455653955057 591
+    .quad 0x03FDD296F34D0B65C
+    .quad 0x03FDD32E67E056BD5  # 0.456231711452 592
+    .quad 0x03FDD32E67E056BD5
+    .quad 0x03FDD3C5F2DDA1840  # 0.456809801843 593
+    .quad 0x03FDD3C5F2DDA1840
+    .quad 0x03FDD490246DEFA6A  # 0.457581109247 594
+    .quad 0x03FDD490246DEFA6A
+    .quad 0x03FDD527E3D1B95FC  # 0.458159980465 595
+    .quad 0x03FDD527E3D1B95FC
+    .quad 0x03FDD5BFB9B5AE71F  # 0.458739186968 596
+    .quad 0x03FDD5BFB9B5AE71F
+    .quad 0x03FDD657A6207C0DB  # 0.459318729146 597
+    .quad 0x03FDD657A6207C0DB
+    .quad 0x03FDD6EFA918D25CE  # 0.459898607388 598
+    .quad 0x03FDD6EFA918D25CE
+    .quad 0x03FDD7BA7AD9E7DA1  # 0.460672301817 599
+    .quad 0x03FDD7BA7AD9E7DA1
+    .quad 0x03FDD852B28BE5A0F  # 0.461252965726 600
+    .quad 0x03FDD852B28BE5A0F
+    .quad 0x03FDD8EB00E1CCE14  # 0.461833967001 601
+    .quad 0x03FDD8EB00E1CCE14
+    .quad 0x03FDD98365E25ABB9  # 0.462415306035 602
+    .quad 0x03FDD98365E25ABB9
+    .quad 0x03FDDA1BE1944F538  # 0.462996983220 603
+    .quad 0x03FDDA1BE1944F538
+    .quad 0x03FDDAE75484C9615  # 0.463773079495 604
+    .quad 0x03FDDAE75484C9615
+    .quad 0x03FDDB8005445488B  # 0.464355547233 605
+    .quad 0x03FDDB8005445488B
+    .quad 0x03FDDC18CCCBDCB83  # 0.464938354438 606
+    .quad 0x03FDDC18CCCBDCB83
+    .quad 0x03FDDCB1AB222F33D  # 0.465521501504 607
+    .quad 0x03FDDCB1AB222F33D
+    .quad 0x03FDDD4AA04E1C4B7  # 0.466104988830 608
+    .quad 0x03FDDD4AA04E1C4B7
+    .quad 0x03FDDDE3AC56775D2  # 0.466688816812 609
+    .quad 0x03FDDDE3AC56775D2
+    .quad 0x03FDDE7CCF4216D6E  # 0.467272985848 610
+    .quad 0x03FDDE7CCF4216D6E
+    .quad 0x03FDDF492177D7BBC  # 0.468052409114 611
+    .quad 0x03FDDF492177D7BBC
+    .quad 0x03FDDFE279E5BF4EE  # 0.468637375496 612
+    .quad 0x03FDDFE279E5BF4EE
+    .quad 0x03FDE07BE94DCC439  # 0.469222684263 613
+    .quad 0x03FDE07BE94DCC439
+    .quad 0x03FDE1156FB6E2626  # 0.469808335817 614
+    .quad 0x03FDE1156FB6E2626
+    .quad 0x03FDE1AF0D27E88D7  # 0.470394330560 615
+    .quad 0x03FDE1AF0D27E88D7
+    .quad 0x03FDE248C1A7C8C26  # 0.470980668894 616
+    .quad 0x03FDE248C1A7C8C26
+    .quad 0x03FDE2E28D3D701CC  # 0.471567351222 617
+    .quad 0x03FDE2E28D3D701CC
+    .quad 0x03FDE37C6FEFCED73  # 0.472154377948 618
+    .quad 0x03FDE37C6FEFCED73
+    .quad 0x03FDE449C232C39D8  # 0.472937616681 619
+    .quad 0x03FDE449C232C39D8
+    .quad 0x03FDE4E3DAEDDB5F6  # 0.473525448578 620
+    .quad 0x03FDE4E3DAEDDB5F6
+    .quad 0x03FDE57E0ADCE1EA5  # 0.474113626224 621
+    .quad 0x03FDE57E0ADCE1EA5
+    .quad 0x03FDE6185206D516F  # 0.474702150027 622
+    .quad 0x03FDE6185206D516F
+    .quad 0x03FDE6B2B072B5E6F  # 0.475291020395 623
+    .quad 0x03FDE6B2B072B5E6F
+    .quad 0x03FDE74D26278887A  # 0.475880237735 624
+    .quad 0x03FDE74D26278887A
+    .quad 0x03FDE7E7B32C5453F  # 0.476469802457 625
+    .quad 0x03FDE7E7B32C5453F
+    .quad 0x03FDE882578823D52  # 0.477059714970 626
+    .quad 0x03FDE882578823D52
+    .quad 0x03FDE91D134204C67  # 0.477649975686 627
+    .quad 0x03FDE91D134204C67
+    .quad 0x03FDE9B7E6610815A  # 0.478240585015 628
+    .quad 0x03FDE9B7E6610815A
+    .quad 0x03FDEA52D0EC41E5E  # 0.478831543369 629
+    .quad 0x03FDEA52D0EC41E5E
+    .quad 0x03FDEB218376ECFC0  # 0.479620031484 630
+    .quad 0x03FDEB218376ECFC0
+    .quad 0x03FDEBBCA4C4E9E87  # 0.480211805838 631
+    .quad 0x03FDEBBCA4C4E9E87
+    .quad 0x03FDEC57DD96CD0CB  # 0.480803930597 632
+    .quad 0x03FDEC57DD96CD0CB
+    .quad 0x03FDECF32DF3B887D  # 0.481396406174 633
+    .quad 0x03FDECF32DF3B887D
+    .quad 0x03FDED8E95E2D1B88  # 0.481989232987 634
+    .quad 0x03FDED8E95E2D1B88
+    .quad 0x03FDEE2A156B413E5  # 0.482582411453 635
+    .quad 0x03FDEE2A156B413E5
+    .quad 0x03FDEEC5AC9432FCB  # 0.483175941987 636
+    .quad 0x03FDEEC5AC9432FCB
+    .quad 0x03FDEF615B64D61C7  # 0.483769825010 637
+    .quad 0x03FDEF615B64D61C7
+    .quad 0x03FDEFFD21E45D0D1  # 0.484364060939 638
+    .quad 0x03FDEFFD21E45D0D1
+    .quad 0x03FDF0990019FD887  # 0.484958650194 639
+    .quad 0x03FDF0990019FD887
+    .quad 0x03FDF134F60CF092D  # 0.485553593197 640
+    .quad 0x03FDF134F60CF092D
+    .quad 0x03FDF1D103C4727E4  # 0.486148890367 641
+    .quad 0x03FDF1D103C4727E4
+    .quad 0x03FDF26D2947C2EC5  # 0.486744542127 642
+    .quad 0x03FDF26D2947C2EC5
+    .quad 0x03FDF309669E24CF9  # 0.487340548899 643
+    .quad 0x03FDF309669E24CF9
+    .quad 0x03FDF3A5BBCEDE6E1  # 0.487936911107 644
+    .quad 0x03FDF3A5BBCEDE6E1
+    .quad 0x03FDF44228E13963A  # 0.488533629176 645
+    .quad 0x03FDF44228E13963A
+    .quad 0x03FDF4DEADDC82A35  # 0.489130703529 646
+    .quad 0x03FDF4DEADDC82A35
+    .quad 0x03FDF57B4AC80A79A  # 0.489728134594 647
+    .quad 0x03FDF57B4AC80A79A
+    .quad 0x03FDF617FFAB248ED  # 0.490325922795 648
+    .quad 0x03FDF617FFAB248ED
+    .quad 0x03FDF6B4CC8D27E87  # 0.490924068561 649
+    .quad 0x03FDF6B4CC8D27E87
+    .quad 0x03FDF751B1756EEC8  # 0.491522572320 650
+    .quad 0x03FDF751B1756EEC8
+    .quad 0x03FDF7EEAE6B5761C  # 0.492121434499 651
+    .quad 0x03FDF7EEAE6B5761C
+    .quad 0x03FDF88BC3764273B  # 0.492720655530 652
+    .quad 0x03FDF88BC3764273B
+    .quad 0x03FDF928F09D94B32  # 0.493320235842 653
+    .quad 0x03FDF928F09D94B32
+    .quad 0x03FDF9C635E8B6192  # 0.493920175866 654
+    .quad 0x03FDF9C635E8B6192
+    .quad 0x03FDFA63935F1208C  # 0.494520476034 655
+    .quad 0x03FDFA63935F1208C
+    .quad 0x03FDFB0109081751A  # 0.495121136779 656
+    .quad 0x03FDFB0109081751A
+    .quad 0x03FDFB9E96EB38311  # 0.495722158534 657
+    .quad 0x03FDFB9E96EB38311
+    .quad 0x03FDFC3C3D0FEA555  # 0.496323541733 658
+    .quad 0x03FDFC3C3D0FEA555
+    .quad 0x03FDFCD9FB7DA6DEF  # 0.496925286812 659
+    .quad 0x03FDFCD9FB7DA6DEF
+    .quad 0x03FDFD77D23BEA634  # 0.497527394206 660
+    .quad 0x03FDFD77D23BEA634
+    .quad 0x03FDFE15C15234EE2  # 0.498129864352 661
+    .quad 0x03FDFE15C15234EE2
+    .quad 0x03FDFEB3C8C80A04E  # 0.498732697687 662
+    .quad 0x03FDFEB3C8C80A04E
+    .quad 0x03FDFF51E8A4F0A74  # 0.499335894649 663
+    .quad 0x03FDFF51E8A4F0A74
+    .quad 0x03FDFFF020F07352E  # 0.499939455677 664
+    .quad 0x03FDFFF020F07352E
+    .quad 0x03FE004738D910023  # 0.500543381211 665
+    .quad 0x03FE004738D910023
+    .quad 0x03FE00966D78C41CF  # 0.501147671692 666
+    .quad 0x03FE00966D78C41CF
+    .quad 0x03FE00E5AE5B207AB  # 0.501752327560 667
+    .quad 0x03FE00E5AE5B207AB
+    .quad 0x03FE011A8B18F0ED6  # 0.502155634684 668
+    .quad 0x03FE011A8B18F0ED6
+    .quad 0x03FE0169E072D7311  # 0.502760900515 669
+    .quad 0x03FE0169E072D7311
+    .quad 0x03FE01B942198A5A1  # 0.503366532915 670
+    .quad 0x03FE01B942198A5A1
+    .quad 0x03FE0208B010DB642  # 0.503972532327 671
+    .quad 0x03FE0208B010DB642
+    .quad 0x03FE02582A5C9D122  # 0.504578899198 672
+    .quad 0x03FE02582A5C9D122
+    .quad 0x03FE02A7B100A3EF0  # 0.505185633972 673
+    .quad 0x03FE02A7B100A3EF0
+    .quad 0x03FE02F74400C64EA  # 0.505792737097 674
+    .quad 0x03FE02F74400C64EA
+    .quad 0x03FE0346E360DC4F9  # 0.506400209020 675
+    .quad 0x03FE0346E360DC4F9
+    .quad 0x03FE03968F24BFDB6  # 0.507008050190 676
+    .quad 0x03FE03968F24BFDB6
+    .quad 0x03FE03E647504CA89  # 0.507616261055 677
+    .quad 0x03FE03E647504CA89
+    .quad 0x03FE04360BE7603AE  # 0.508224842066 678
+    .quad 0x03FE04360BE7603AE
+    .quad 0x03FE046B4089BE0FD  # 0.508630768599 679
+    .quad 0x03FE046B4089BE0FD
+    .quad 0x03FE04BB19DCA36B3  # 0.509239967521 680
+    .quad 0x03FE04BB19DCA36B3
+    .quad 0x03FE050AFFA5671A5  # 0.509849537793 681
+    .quad 0x03FE050AFFA5671A5
+    .quad 0x03FE055AF1E7ED47B  # 0.510459479867 682
+    .quad 0x03FE055AF1E7ED47B
+    .quad 0x03FE05AAF0A81BF04  # 0.511069794198 683
+    .quad 0x03FE05AAF0A81BF04
+    .quad 0x03FE05FAFBE9DAE58  # 0.511680481240 684
+    .quad 0x03FE05FAFBE9DAE58
+    .quad 0x03FE064B13B113CDD  # 0.512291541448 685
+    .quad 0x03FE064B13B113CDD
+    .quad 0x03FE069B3801B2263  # 0.512902975280 686
+    .quad 0x03FE069B3801B2263
+    .quad 0x03FE06D0AC85B63A2  # 0.513310805628 687
+    .quad 0x03FE06D0AC85B63A2
+    .quad 0x03FE0720E5C40DF1D  # 0.513922863181 688
+    .quad 0x03FE0720E5C40DF1D
+    .quad 0x03FE07712B9648153  # 0.514535295577 689
+    .quad 0x03FE07712B9648153
+    .quad 0x03FE07C17E0056E7C  # 0.515148103277 690
+    .quad 0x03FE07C17E0056E7C
+    .quad 0x03FE0811DD062E889  # 0.515761286740 691
+    .quad 0x03FE0811DD062E889
+    .quad 0x03FE086248ABC4F3B  # 0.516374846428 692
+    .quad 0x03FE086248ABC4F3B
+    .quad 0x03FE08B2C0F512033  # 0.516988782802 693
+    .quad 0x03FE08B2C0F512033
+    .quad 0x03FE08E86D82DA3EE  # 0.517398283218 694
+    .quad 0x03FE08E86D82DA3EE
+    .quad 0x03FE0938FAE5D8E9B  # 0.518012848432 695
+    .quad 0x03FE0938FAE5D8E9B
+    .quad 0x03FE098994F72C539  # 0.518627791569 696
+    .quad 0x03FE098994F72C539
+    .quad 0x03FE09DA3BBAD339C  # 0.519243113094 697
+    .quad 0x03FE09DA3BBAD339C
+    .quad 0x03FE0A2AEF34CE3D1  # 0.519858813473 698
+    .quad 0x03FE0A2AEF34CE3D1
+    .quad 0x03FE0A7BAF691FE34  # 0.520474893172 699
+    .quad 0x03FE0A7BAF691FE34
+    .quad 0x03FE0AB18BF5823C3  # 0.520885823936 700
+    .quad 0x03FE0AB18BF5823C3
+    .quad 0x03FE0B02616952989  # 0.521502536876 701
+    .quad 0x03FE0B02616952989
+    .quad 0x03FE0B5343A234476  # 0.522119630385 702
+    .quad 0x03FE0B5343A234476
+    .quad 0x03FE0BA432A430CA2  # 0.522737104934 703
+    .quad 0x03FE0BA432A430CA2
+    .quad 0x03FE0BF52E73538CE  # 0.523354960993 704
+    .quad 0x03FE0BF52E73538CE
+    .quad 0x03FE0C463713A9E6F  # 0.523973199034 705
+    .quad 0x03FE0C463713A9E6F
+    .quad 0x03FE0C7C43F4C861E  # 0.524385570174 706
+    .quad 0x03FE0C7C43F4C861E
+    .quad 0x03FE0CCD61FAD07D2  # 0.525004445903 707
+    .quad 0x03FE0CCD61FAD07D2
+    .quad 0x03FE0D1E8CDCE3DB6  # 0.525623704876 708
+    .quad 0x03FE0D1E8CDCE3DB6
+    .quad 0x03FE0D6FC49F16E93  # 0.526243347569 709
+    .quad 0x03FE0D6FC49F16E93
+    .quad 0x03FE0DC109458004A  # 0.526863374456 710
+    .quad 0x03FE0DC109458004A
+    .quad 0x03FE0DF73E353F0ED  # 0.527276939392 711
+    .quad 0x03FE0DF73E353F0ED
+    .quad 0x03FE0E4898611CCE1  # 0.527897607665 712
+    .quad 0x03FE0E4898611CCE1
+    .quad 0x03FE0E99FF7C20738  # 0.528518661406 713
+    .quad 0x03FE0E99FF7C20738
+    .quad 0x03FE0EEB738A67874  # 0.529140101094 714
+    .quad 0x03FE0EEB738A67874
+    .quad 0x03FE0F21C81D1ADC3  # 0.529554608872 715
+    .quad 0x03FE0F21C81D1ADC3
+    .quad 0x03FE0F7351C9FCD7F  # 0.530176692874 716
+    .quad 0x03FE0F7351C9FCD7F
+    .quad 0x03FE0FC4E875254C1  # 0.530799164104 717
+    .quad 0x03FE0FC4E875254C1
+    .quad 0x03FE10168C22B8FB9  # 0.531422023047 718
+    .quad 0x03FE10168C22B8FB9
+    .quad 0x03FE10683CD6DEA54  # 0.532045270185 719
+    .quad 0x03FE10683CD6DEA54
+    .quad 0x03FE109EB9E2E4C97  # 0.532460984179 720
+    .quad 0x03FE109EB9E2E4C97
+    .quad 0x03FE10F08055E7785  # 0.533084879385 721
+    .quad 0x03FE10F08055E7785
+    .quad 0x03FE114253DA97DA0  # 0.533709164079 722
+    .quad 0x03FE114253DA97DA0
+    .quad 0x03FE1194347523FDC  # 0.534333838748 723
+    .quad 0x03FE1194347523FDC
+    .quad 0x03FE11CAD1789B0F8  # 0.534750505421 724
+    .quad 0x03FE11CAD1789B0F8
+    .quad 0x03FE121CC7EB8F7E6  # 0.535375831132 725
+    .quad 0x03FE121CC7EB8F7E6
+    .quad 0x03FE126ECB7F8F007  # 0.536001548120 726
+    .quad 0x03FE126ECB7F8F007
+    .quad 0x03FE12A57FDA37091  # 0.536418910396 727
+    .quad 0x03FE12A57FDA37091
+    .quad 0x03FE12F799594EFBC  # 0.537045280601 728
+    .quad 0x03FE12F799594EFBC
+    .quad 0x03FE1349C004AFB00  # 0.537672043392 729
+    .quad 0x03FE1349C004AFB00
+    .quad 0x03FE139BF3E094003  # 0.538299199261 730
+    .quad 0x03FE139BF3E094003
+    .quad 0x03FE13D2C873C5E13  # 0.538717521794 731
+    .quad 0x03FE13D2C873C5E13
+    .quad 0x03FE142512549C16C  # 0.539345333889 732
+    .quad 0x03FE142512549C16C
+    .quad 0x03FE14776971477F1  # 0.539973540381 733
+    .quad 0x03FE14776971477F1
+    .quad 0x03FE14C9CDCE0A74D  # 0.540602141763 734
+    .quad 0x03FE14C9CDCE0A74D
+    .quad 0x03FE1500C2BFD1561  # 0.541021428981 735
+    .quad 0x03FE1500C2BFD1561
+    .quad 0x03FE15533D3B8D7B3  # 0.541650689621 736
+    .quad 0x03FE15533D3B8D7B3
+    .quad 0x03FE15A5C502C6DC5  # 0.542280346478 737
+    .quad 0x03FE15A5C502C6DC5
+    .quad 0x03FE15DCD1973457B  # 0.542700338085 738
+    .quad 0x03FE15DCD1973457B
+    .quad 0x03FE162F6F9071F76  # 0.543330656416 739
+    .quad 0x03FE162F6F9071F76
+    .quad 0x03FE16821AE0A13C6  # 0.543961372300 740
+    .quad 0x03FE16821AE0A13C6
+    .quad 0x03FE16B93F2C12808  # 0.544382070665 741
+    .quad 0x03FE16B93F2C12808
+    .quad 0x03FE170C00C169B51  # 0.545013450251 742
+    .quad 0x03FE170C00C169B51
+    .quad 0x03FE175ECFB935CC6  # 0.545645228728 743
+    .quad 0x03FE175ECFB935CC6
+    .quad 0x03FE17B1AC17CBD5B  # 0.546277406602 744
+    .quad 0x03FE17B1AC17CBD5B
+    .quad 0x03FE17E8F12052E8A  # 0.546699080654 745
+    .quad 0x03FE17E8F12052E8A
+    .quad 0x03FE183BE3DE8A7AF  # 0.547331925312 746
+    .quad 0x03FE183BE3DE8A7AF
+    .quad 0x03FE188EE40F23CA7  # 0.547965170715 747
+    .quad 0x03FE188EE40F23CA7
+    .quad 0x03FE18C640FF75F06  # 0.548387557205 748
+    .quad 0x03FE18C640FF75F06
+    .quad 0x03FE191957A30FA51  # 0.549021471648 749
+    .quad 0x03FE191957A30FA51
+    .quad 0x03FE196C7BC4B1F3A  # 0.549655788193 750
+    .quad 0x03FE196C7BC4B1F3A
+    .quad 0x03FE19A3F0B1860BD  # 0.550078889532 751
+    .quad 0x03FE19A3F0B1860BD
+    .quad 0x03FE19F72B59A0CEC  # 0.550713877383 752
+    .quad 0x03FE19F72B59A0CEC
+    .quad 0x03FE1A4A738B7A33C  # 0.551349268700 753
+    .quad 0x03FE1A4A738B7A33C
+    .quad 0x03FE1A820089A2156  # 0.551773087312 754
+    .quad 0x03FE1A820089A2156
+    .quad 0x03FE1AD55F55855C8  # 0.552409152212 755
+    .quad 0x03FE1AD55F55855C8
+    .quad 0x03FE1B28CBB6EC93E  # 0.553045621948 756
+    .quad 0x03FE1B28CBB6EC93E
+    .quad 0x03FE1B6070DB553D8  # 0.553470160269 757
+    .quad 0x03FE1B6070DB553D8
+    .quad 0x03FE1BB3F3EA714F6  # 0.554107305878 758
+    .quad 0x03FE1BB3F3EA714F6
+    .quad 0x03FE1BEBA8316EF2C  # 0.554532295260 759
+    .quad 0x03FE1BEBA8316EF2C
+    .quad 0x03FE1C3F41FA97C6B  # 0.555170118179 760
+    .quad 0x03FE1C3F41FA97C6B
+    .quad 0x03FE1C92E96C86020  # 0.555808348176 761
+    .quad 0x03FE1C92E96C86020
+    .quad 0x03FE1CCAB5FBFFEE1  # 0.556234061252 762
+    .quad 0x03FE1CCAB5FBFFEE1
+    .quad 0x03FE1D1E743BCFC47  # 0.556872970868 763
+    .quad 0x03FE1D1E743BCFC47
+    .quad 0x03FE1D72403052E75  # 0.557512288951 764
+    .quad 0x03FE1D72403052E75
+    .quad 0x03FE1DAA251D7E433  # 0.557938728190 765
+    .quad 0x03FE1DAA251D7E433
+    .quad 0x03FE1DFE07F3D1DAB  # 0.558578728212 766
+    .quad 0x03FE1DFE07F3D1DAB
+    .quad 0x03FE1E35FC265D75E  # 0.559005622562 767
+    .quad 0x03FE1E35FC265D75E
+    .quad 0x03FE1E89F5EB04126  # 0.559646305979 768
+    .quad 0x03FE1E89F5EB04126
+    .quad 0x03FE1EDDFD77E1FEF  # 0.560287400135 769
+    .quad 0x03FE1EDDFD77E1FEF
+    .quad 0x03FE1F160A2AD0DA3  # 0.560715024687 770
+    .quad 0x03FE1F160A2AD0DA3
+    .quad 0x03FE1F6A28BA1B476  # 0.561356804579 771
+    .quad 0x03FE1F6A28BA1B476
+    .quad 0x03FE1FBE551DB43C1  # 0.561998996616 772
+    .quad 0x03FE1FBE551DB43C1
+    .quad 0x03FE1FF67A6684F47  # 0.562427353873 773
+    .quad 0x03FE1FF67A6684F47
+    .quad 0x03FE204ABDE0BE5DF  # 0.563070233998 774
+    .quad 0x03FE204ABDE0BE5DF
+    .quad 0x03FE2082F29233211  # 0.563499050471 775
+    .quad 0x03FE2082F29233211
+    .quad 0x03FE20D74D2FBAFE4  # 0.564142620160 776
+    .quad 0x03FE20D74D2FBAFE4
+    .quad 0x03FE210F91524B469  # 0.564571896835 777
+    .quad 0x03FE210F91524B469
+    .quad 0x03FE2164031FDA0B0  # 0.565216157568 778
+    .quad 0x03FE2164031FDA0B0
+    .quad 0x03FE21B882DD26040  # 0.565860833641 779
+    .quad 0x03FE21B882DD26040
+    .quad 0x03FE21F0DFC65CEEC  # 0.566290848698 780
+    .quad 0x03FE21F0DFC65CEEC
+    .quad 0x03FE224576C81FFE0  # 0.566936218194 781
+    .quad 0x03FE224576C81FFE0
+    .quad 0x03FE227DE33896A44  # 0.567366696031 782
+    .quad 0x03FE227DE33896A44
+    .quad 0x03FE22D2918BA4A31  # 0.568012760445 783
+    .quad 0x03FE22D2918BA4A31
+    .quad 0x03FE23274DE272A83  # 0.568659242528 784
+    .quad 0x03FE23274DE272A83
+    .quad 0x03FE235FD33D232FC  # 0.569090462888 785
+    .quad 0x03FE235FD33D232FC
+    .quad 0x03FE23B4A6F9D8688  # 0.569737642287 786
+    .quad 0x03FE23B4A6F9D8688
+    .quad 0x03FE23ED3BF21CA33  # 0.570169328026 787
+    .quad 0x03FE23ED3BF21CA33
+    .quad 0x03FE24422721A89D7  # 0.570817206248 788
+    .quad 0x03FE24422721A89D7
+    .quad 0x03FE247ACBC023D2B  # 0.571249358372 789
+    .quad 0x03FE247ACBC023D2B
+    .quad 0x03FE24CFCE6F80D9B  # 0.571897936927 790
+    .quad 0x03FE24CFCE6F80D9B
+    .quad 0x03FE250882BCDD7D8  # 0.572330556445 791
+    .quad 0x03FE250882BCDD7D8
+    .quad 0x03FE255D9CF910A56  # 0.572979836849 792
+    .quad 0x03FE255D9CF910A56
+    .quad 0x03FE25B2C55CD5762  # 0.573629539091 793
+    .quad 0x03FE25B2C55CD5762
+    .quad 0x03FE25EB92D41992D  # 0.574062908546 794
+    .quad 0x03FE25EB92D41992D
+    .quad 0x03FE2640D2D99FFEA  # 0.574713315073 795
+    .quad 0x03FE2640D2D99FFEA
+    .quad 0x03FE2679B0166F51C  # 0.575147154559 796
+    .quad 0x03FE2679B0166F51C
+    .quad 0x03FE26CF07CAD8B00  # 0.575798266899 797
+    .quad 0x03FE26CF07CAD8B00
+    .quad 0x03FE2707F4D5F7C40  # 0.576232577438 798
+    .quad 0x03FE2707F4D5F7C40
+    .quad 0x03FE275D644670606  # 0.576884397124 799
+    .quad 0x03FE275D644670606
+    .quad 0x03FE27966128AB11B  # 0.577319179739 800
+    .quad 0x03FE27966128AB11B
+    .quad 0x03FE27EBE8626A387  # 0.577971708311 801
+    .quad 0x03FE27EBE8626A387
+    .quad 0x03FE2824F52493BD2  # 0.578406964030 802
+    .quad 0x03FE2824F52493BD2
+    .quad 0x03FE287A9434DBC7B  # 0.579060203030 803
+    .quad 0x03FE287A9434DBC7B
+    .quad 0x03FE28B3B0DFCEB80  # 0.579495932884 804
+    .quad 0x03FE28B3B0DFCEB80
+    .quad 0x03FE290967D3ED18D  # 0.580149883861 805
+    .quad 0x03FE290967D3ED18D
+    .quad 0x03FE294294708B773  # 0.580586088885 806
+    .quad 0x03FE294294708B773
+    .quad 0x03FE29986355D8C69  # 0.581240753393 807
+    .quad 0x03FE29986355D8C69
+    .quad 0x03FE29D19FED0C082  # 0.581677434622 808
+    .quad 0x03FE29D19FED0C082
+    .quad 0x03FE2A2786D0EC107  # 0.582332814220 809
+    .quad 0x03FE2A2786D0EC107
+    .quad 0x03FE2A60D36BA5253  # 0.582769972697 810
+    .quad 0x03FE2A60D36BA5253
+    .quad 0x03FE2AB6D25B86EF7  # 0.583426068948 811
+    .quad 0x03FE2AB6D25B86EF7
+    .quad 0x03FE2AF02F02BE4AB  # 0.583863705716 812
+    .quad 0x03FE2AF02F02BE4AB
+    .quad 0x03FE2B46460C1C2B3  # 0.584520520190 813
+    .quad 0x03FE2B46460C1C2B3
+    .quad 0x03FE2B7FB2C8D1CC1  # 0.584958636297 814
+    .quad 0x03FE2B7FB2C8D1CC1
+    .quad 0x03FE2BD5E1F9316F2  # 0.585616170568 815
+    .quad 0x03FE2BD5E1F9316F2
+    .quad 0x03FE2C0F5ED46CE8D  # 0.586054767066 816
+    .quad 0x03FE2C0F5ED46CE8D
+    .quad 0x03FE2C65A6395F5F5  # 0.586713022712 817
+    .quad 0x03FE2C65A6395F5F5
+    .quad 0x03FE2C9F333C2FE1E  # 0.587152100656 818
+    .quad 0x03FE2C9F333C2FE1E
+    .quad 0x03FE2CF592E351AE5  # 0.587811079263 819
+    .quad 0x03FE2CF592E351AE5
+    .quad 0x03FE2D2F3016CE0EF  # 0.588250639709 820
+    .quad 0x03FE2D2F3016CE0EF
+    .quad 0x03FE2D85A80DC7324  # 0.588910342867 821
+    .quad 0x03FE2D85A80DC7324
+    .quad 0x03FE2DBF557B0DF43  # 0.589350386878 822
+    .quad 0x03FE2DBF557B0DF43
+    .quad 0x03FE2E15E5CF91FA7  # 0.590010816181 823
+    .quad 0x03FE2E15E5CF91FA7
+    .quad 0x03FE2E4FA37FC9577  # 0.590451344823 824
+    .quad 0x03FE2E4FA37FC9577
+    .quad 0x03FE2E8967B3BF4E1  # 0.590892067615 825
+    .quad 0x03FE2E8967B3BF4E1
+    .quad 0x03FE2EE01A3BED567  # 0.591553516212 826
+    .quad 0x03FE2EE01A3BED567
+    .quad 0x03FE2F19EEBFB00BA  # 0.591994725131 827
+    .quad 0x03FE2F19EEBFB00BA
+    .quad 0x03FE2F70B9C67A7C2  # 0.592656903723 828
+    .quad 0x03FE2F70B9C67A7C2
+    .quad 0x03FE2FAA9EA342D04  # 0.593098599843 829
+    .quad 0x03FE2FAA9EA342D04
+    .quad 0x03FE3001823684D73  # 0.593761510043 830
+    .quad 0x03FE3001823684D73
+    .quad 0x03FE303B7775937EF  # 0.594203694441 831
+    .quad 0x03FE303B7775937EF
+    .quad 0x03FE309273A3340FC  # 0.594867337868 832
+    .quad 0x03FE309273A3340FC
+    .quad 0x03FE30CC794DD19D0  # 0.595310011625 833
+    .quad 0x03FE30CC794DD19D0
+    .quad 0x03FE3106858C76BB7  # 0.595752881428 834
+    .quad 0x03FE3106858C76BB7
+    .quad 0x03FE315DA4434068B  # 0.596417554101 835
+    .quad 0x03FE315DA4434068B
+    .quad 0x03FE3197C0FA80E6A  # 0.596860914783 836
+    .quad 0x03FE3197C0FA80E6A
+    .quad 0x03FE31EEF86D36EF1  # 0.597526324589 837
+    .quad 0x03FE31EEF86D36EF1
+    .quad 0x03FE322925A66E62D  # 0.597970177237 838
+    .quad 0x03FE322925A66E62D
+    .quad 0x03FE328075E32022F  # 0.598636325813 839
+    .quad 0x03FE328075E32022F
+    .quad 0x03FE32BAB3A7B21E9  # 0.599080671521 840
+    .quad 0x03FE32BAB3A7B21E9
+    .quad 0x03FE32F4F80D0B1BD  # 0.599525214760 841
+    .quad 0x03FE32F4F80D0B1BD
+    .quad 0x03FE334C6B15D30DD  # 0.600192400374 842
+    .quad 0x03FE334C6B15D30DD
+    .quad 0x03FE3386C013B90D6  # 0.600637438209 843
+    .quad 0x03FE3386C013B90D6
+    .quad 0x03FE33DE4C086C40A  # 0.601305366543 844
+    .quad 0x03FE33DE4C086C40A
+    .quad 0x03FE3418B1A85622C  # 0.601750900077 845
+    .quad 0x03FE3418B1A85622C
+    .quad 0x03FE34531DF21CFE3  # 0.602196632199 846
+    .quad 0x03FE34531DF21CFE3
+    .quad 0x03FE34AACCE299BA5  # 0.602865603124 847
+    .quad 0x03FE34AACCE299BA5
+    .quad 0x03FE34E549DBB21EF  # 0.603311832493 848
+    .quad 0x03FE34E549DBB21EF
+    .quad 0x03FE353D11DA4F855  # 0.603981550121 849
+    .quad 0x03FE353D11DA4F855
+    .quad 0x03FE35779F8C43D6D  # 0.604428277847 850
+    .quad 0x03FE35779F8C43D6D
+    .quad 0x03FE35B233F13DD4A  # 0.604875205229 851
+    .quad 0x03FE35B233F13DD4A
+    .quad 0x03FE360A1F1BBA738  # 0.605545971045 852
+    .quad 0x03FE360A1F1BBA738
+    .quad 0x03FE3644C446F97BC  # 0.605993398346 853
+    .quad 0x03FE3644C446F97BC
+    .quad 0x03FE367F702A9EA94  # 0.606441025927 854
+    .quad 0x03FE367F702A9EA94
+    .quad 0x03FE36D77E9D34FD7  # 0.607112843218 855
+    .quad 0x03FE36D77E9D34FD7
+    .quad 0x03FE37123B54987B7  # 0.607560972287 856
+    .quad 0x03FE37123B54987B7
+    .quad 0x03FE376A630C0A1D6  # 0.608233542652 857
+    .quad 0x03FE376A630C0A1D6
+    .quad 0x03FE37A530A0D5A31  # 0.608682174333 858
+    .quad 0x03FE37A530A0D5A31
+    .quad 0x03FE37E004F74E13B  # 0.609131007374 859
+    .quad 0x03FE37E004F74E13B
+    .quad 0x03FE383850278CFD9  # 0.609804634884 860
+    .quad 0x03FE383850278CFD9
+    .quad 0x03FE3873356902AB7  # 0.610253972119 861
+    .quad 0x03FE3873356902AB7
+    .quad 0x03FE38AE2171976E8  # 0.610703511349 862
+    .quad 0x03FE38AE2171976E8
+    .quad 0x03FE390690373AFFF  # 0.611378199331 863
+    .quad 0x03FE390690373AFFF
+    .quad 0x03FE39418D3872A53  # 0.611828244343 864
+    .quad 0x03FE39418D3872A53
+    .quad 0x03FE397C91064221F  # 0.612278491987 865
+    .quad 0x03FE397C91064221F
+    .quad 0x03FE39D5237E045A5  # 0.612954243787 866
+    .quad 0x03FE39D5237E045A5
+    .quad 0x03FE3A1038522CE82  # 0.613404998809 867
+    .quad 0x03FE3A1038522CE82
+    .quad 0x03FE3A68E45AD354B  # 0.614081512534 868
+    .quad 0x03FE3A68E45AD354B
+    .quad 0x03FE3AA40A3F2A68B  # 0.614532776080 869
+    .quad 0x03FE3AA40A3F2A68B
+    .quad 0x03FE3ADF36F98A182  # 0.614984243356 870
+    .quad 0x03FE3ADF36F98A182
+    .quad 0x03FE3B3806E5DF340  # 0.615661826668 871
+    .quad 0x03FE3B3806E5DF340
+    .quad 0x03FE3B7344BE40311  # 0.616113804077 872
+    .quad 0x03FE3B7344BE40311
+    .quad 0x03FE3BAE897234A87  # 0.616565985862 873
+    .quad 0x03FE3BAE897234A87
+    .quad 0x03FE3C077D5F51881  # 0.617244642149 874
+    .quad 0x03FE3C077D5F51881
+    .quad 0x03FE3C42D33F2AE7B  # 0.617697335683 875
+    .quad 0x03FE3C42D33F2AE7B
+    .quad 0x03FE3C7E30002960C  # 0.618150234241 876
+    .quad 0x03FE3C7E30002960C
+    .quad 0x03FE3CD7480B4A8A3  # 0.618829966906 877
+    .quad 0x03FE3CD7480B4A8A3
+    .quad 0x03FE3D12B60622748  # 0.619283378838 878
+    .quad 0x03FE3D12B60622748
+    .quad 0x03FE3D4E2AE7B7E2B  # 0.619736996447 879
+    .quad 0x03FE3D4E2AE7B7E2B
+    .quad 0x03FE3D89A6B1A558D  # 0.620190819917 880
+    .quad 0x03FE3D89A6B1A558D
+    .quad 0x03FE3DE2ED57B1F9B  # 0.620871941524 881
+    .quad 0x03FE3DE2ED57B1F9B
+    .quad 0x03FE3E1E7A6D8330E  # 0.621326280468 882
+    .quad 0x03FE3E1E7A6D8330E
+    .quad 0x03FE3E5A0E714DA6E  # 0.621780825931 883
+    .quad 0x03FE3E5A0E714DA6E
+    .quad 0x03FE3EB37978B85B6  # 0.622463031756 884
+    .quad 0x03FE3EB37978B85B6
+    .quad 0x03FE3EEF1ED68236B  # 0.622918094335 885
+    .quad 0x03FE3EEF1ED68236B
+    .quad 0x03FE3F2ACB27ED6C7  # 0.623373364090 886
+    .quad 0x03FE3F2ACB27ED6C7
+    .quad 0x03FE3F845AAE68C81  # 0.624056657591 887
+    .quad 0x03FE3F845AAE68C81
+    .quad 0x03FE3FC0186800514  # 0.624512446113 888
+    .quad 0x03FE3FC0186800514
+    .quad 0x03FE3FFBDD1AE8406  # 0.624968442473 889
+    .quad 0x03FE3FFBDD1AE8406
+    .quad 0x03FE4037A8C8C197A  # 0.625424646860 890
+    .quad 0x03FE4037A8C8C197A
+    .quad 0x03FE409167679DD99  # 0.626109343909 891
+    .quad 0x03FE409167679DD99
+    .quad 0x03FE40CD448FF6DD6  # 0.626566069196 892
+    .quad 0x03FE40CD448FF6DD6
+    .quad 0x03FE410928B8F950F  # 0.627023003177 893
+    .quad 0x03FE410928B8F950F
+    .quad 0x03FE41630C1B50AFF  # 0.627708795866 894
+    .quad 0x03FE41630C1B50AFF
+    .quad 0x03FE419F01CD27AD0  # 0.628166252416 895
+    .quad 0x03FE419F01CD27AD0
+    .quad 0x03FE41DAFE85672B9  # 0.628623918328 896
+    .quad 0x03FE41DAFE85672B9
+    .quad 0x03FE42170245B4C6A  # 0.629081793794 897
+    .quad 0x03FE42170245B4C6A
+    .quad 0x03FE42711518DF546  # 0.629769000326 898
+    .quad 0x03FE42711518DF546
+    .quad 0x03FE42AD2A74888A0  # 0.630227400518 899
+    .quad 0x03FE42AD2A74888A0
+    .quad 0x03FE42E946DE080C0  # 0.630686010936 900
+    .quad 0x03FE42E946DE080C0
+    .quad 0x03FE43437EB9D9424  # 0.631374321162 901
+    .quad 0x03FE43437EB9D9424
+    .quad 0x03FE437FACCD31C10  # 0.631833457993 902
+    .quad 0x03FE437FACCD31C10
+    .quad 0x03FE43BBE1F42FE09  # 0.632292805727 903
+    .quad 0x03FE43BBE1F42FE09
+    .quad 0x03FE43F81E307DE5E  # 0.632752364559 904
+    .quad 0x03FE43F81E307DE5E
+    .quad 0x03FE445285D68EA69  # 0.633442099038 905
+    .quad 0x03FE445285D68EA69
+    .quad 0x03FE448ED3CF71355  # 0.633902186463 906
+    .quad 0x03FE448ED3CF71355
+    .quad 0x03FE44CB28E37C3EE  # 0.634362485666 907
+    .quad 0x03FE44CB28E37C3EE
+    .quad 0x03FE450785145CAFE  # 0.634822996841 908
+    .quad 0x03FE450785145CAFE
+    .quad 0x03FE45621CB769366  # 0.635514161481 909
+    .quad 0x03FE45621CB769366
+    .quad 0x03FE459E8AB7B799D  # 0.635975203444 910
+    .quad 0x03FE459E8AB7B799D
+    .quad 0x03FE45DAFFDABD4DB  # 0.636436458065 911
+    .quad 0x03FE45DAFFDABD4DB
+    .quad 0x03FE46177C2229EC0  # 0.636897925539 912
+    .quad 0x03FE46177C2229EC0
+    .quad 0x03FE467243F53F69E  # 0.637590526283 913
+    .quad 0x03FE467243F53F69E
+    .quad 0x03FE46AED21F117FC  # 0.638052526753 914
+    .quad 0x03FE46AED21F117FC
+    .quad 0x03FE46EB677335D13  # 0.638514740766 915
+    .quad 0x03FE46EB677335D13
+    .quad 0x03FE472803F35EAAE  # 0.638977168520 916
+    .quad 0x03FE472803F35EAAE
+    .quad 0x03FE4764A7A13EF3B  # 0.639439810212 917
+    .quad 0x03FE4764A7A13EF3B
+    .quad 0x03FE47BFAA9F80271  # 0.640134174319 918
+    .quad 0x03FE47BFAA9F80271
+    .quad 0x03FE47FC60471DAF8  # 0.640597351724 919
+    .quad 0x03FE47FC60471DAF8
+    .quad 0x03FE48391D226992D  # 0.641060743762 920
+    .quad 0x03FE48391D226992D
+    .quad 0x03FE4875E1331971E  # 0.641524350631 921
+    .quad 0x03FE4875E1331971E
+    .quad 0x03FE48D114D3FB884  # 0.642220164181 922
+    .quad 0x03FE48D114D3FB884
+    .quad 0x03FE490DEAF1A3FC8  # 0.642684309003 923
+    .quad 0x03FE490DEAF1A3FC8
+    .quad 0x03FE494AC84AB0ED3  # 0.643148669355 924
+    .quad 0x03FE494AC84AB0ED3
+    .quad 0x03FE4987ACE0DABB0  # 0.643613245438 925
+    .quad 0x03FE4987ACE0DABB0
+    .quad 0x03FE49C498B5DA63F  # 0.644078037452 926
+    .quad 0x03FE49C498B5DA63F
+    .quad 0x03FE4A20080EF10B2  # 0.644775630783 927
+    .quad 0x03FE4A20080EF10B2
+    .quad 0x03FE4A5D060894B8C  # 0.645240963504 928
+    .quad 0x03FE4A5D060894B8C
+    .quad 0x03FE4A9A0B471A943  # 0.645706512861 929
+    .quad 0x03FE4A9A0B471A943
+    .quad 0x03FE4AD717CC3E626  # 0.646172279055 930
+    .quad 0x03FE4AD717CC3E626
+    .quad 0x03FE4B142B99BC871  # 0.646638262288 931
+    .quad 0x03FE4B142B99BC871
+    .quad 0x03FE4B6FD6F970C1F  # 0.647337644529 932
+    .quad 0x03FE4B6FD6F970C1F
+    .quad 0x03FE4BACFD036D080  # 0.647804171246 933
+    .quad 0x03FE4BACFD036D080
+    .quad 0x03FE4BEA2A5BDBE87  # 0.648270915712 934
+    .quad 0x03FE4BEA2A5BDBE87
+    .quad 0x03FE4C275F047C956  # 0.648737878130 935
+    .quad 0x03FE4C275F047C956
+    .quad 0x03FE4C649AFF0EE16  # 0.649205058703 936
+    .quad 0x03FE4C649AFF0EE16
+    .quad 0x03FE4CC082B46485A  # 0.649906239052 937
+    .quad 0x03FE4CC082B46485A
+    .quad 0x03FE4CFDD1037E37C  # 0.650373965908 938
+    .quad 0x03FE4CFDD1037E37C
+    .quad 0x03FE4D3B26AAADDD9  # 0.650841911635 939
+    .quad 0x03FE4D3B26AAADDD9
+    .quad 0x03FE4D7883ABB61F6  # 0.651310076438 940
+    .quad 0x03FE4D7883ABB61F6
+    .quad 0x03FE4DB5E8085A477  # 0.651778460521 941
+    .quad 0x03FE4DB5E8085A477
+    .quad 0x03FE4DF353C25E42B  # 0.652247064091 942
+    .quad 0x03FE4DF353C25E42B
+    .quad 0x03FE4E4F832C560DD  # 0.652950381434 943
+    .quad 0x03FE4E4F832C560DD
+    .quad 0x03FE4E8D015786F16  # 0.653419534621 944
+    .quad 0x03FE4E8D015786F16
+    .quad 0x03FE4ECA86E64A683  # 0.653888908016 945
+    .quad 0x03FE4ECA86E64A683
+    .quad 0x03FE4F0813DA673DD  # 0.654358501826 946
+    .quad 0x03FE4F0813DA673DD
+    .quad 0x03FE4F45A835A4E19  # 0.654828316258 947
+    .quad 0x03FE4F45A835A4E19
+    .quad 0x03FE4F8343F9CB678  # 0.655298351519 948
+    .quad 0x03FE4F8343F9CB678
+    .quad 0x03FE4FDFBB88A119A  # 0.656003818920 949
+    .quad 0x03FE4FDFBB88A119A
+    .quad 0x03FE501D69DADD660  # 0.656474407164 950
+    .quad 0x03FE501D69DADD660
+    .quad 0x03FE505B1F9C43ED7  # 0.656945216966 951
+    .quad 0x03FE505B1F9C43ED7
+    .quad 0x03FE5098DCCE9FABA  # 0.657416248534 952
+    .quad 0x03FE5098DCCE9FABA
+    .quad 0x03FE50D6A173BC425  # 0.657887502077 953
+    .quad 0x03FE50D6A173BC425
+    .quad 0x03FE51146D8D65F98  # 0.658358977805 954
+    .quad 0x03FE51146D8D65F98
+    .quad 0x03FE5152411D69C03  # 0.658830675927 955
+    .quad 0x03FE5152411D69C03
+    .quad 0x03FE51AF0C774A2D0  # 0.659538640558 956
+    .quad 0x03FE51AF0C774A2D0
+    .quad 0x03FE51ECF2B713F8A  # 0.660010895584 957
+    .quad 0x03FE51ECF2B713F8A
+    .quad 0x03FE522AE0738A3D8  # 0.660483373741 958
+    .quad 0x03FE522AE0738A3D8
+    .quad 0x03FE5268D5AE7CDCB  # 0.660956075239 959
+    .quad 0x03FE5268D5AE7CDCB
+    .quad 0x03FE52A6D269BC600  # 0.661429000289 960
+    .quad 0x03FE52A6D269BC600
+    .quad 0x03FE52E4D6A719F9B  # 0.661902149103 961
+    .quad 0x03FE52E4D6A719F9B
+    .quad 0x03FE5322E26867857  # 0.662375521893 962
+    .quad 0x03FE5322E26867857
+    .quad 0x03FE53800225BA6E2  # 0.663086001497 963
+    .quad 0x03FE53800225BA6E2
+    .quad 0x03FE53BE20B8DA502  # 0.663559935155 964
+    .quad 0x03FE53BE20B8DA502
+    .quad 0x03FE53FC46D64DDD1  # 0.664034093533 965
+    .quad 0x03FE53FC46D64DDD1
+    .quad 0x03FE543A747FE9ED6  # 0.664508476843 966
+    .quad 0x03FE543A747FE9ED6
+    .quad 0x03FE5478A9B78404C  # 0.664983085300 967
+    .quad 0x03FE5478A9B78404C
+    .quad 0x03FE54B6E67EF251C  # 0.665457919117 968
+    .quad 0x03FE54B6E67EF251C
+    .quad 0x03FE54F52AD80BAE9  # 0.665932978509 969
+    .quad 0x03FE54F52AD80BAE9
+    .quad 0x03FE553376C4A7A16  # 0.666408263689 970
+    .quad 0x03FE553376C4A7A16
+    .quad 0x03FE5571CA469E5C9  # 0.666883774872 971
+    .quad 0x03FE5571CA469E5C9
+    .quad 0x03FE55CF55C5A5437  # 0.667597465874 972
+    .quad 0x03FE55CF55C5A5437
+    .quad 0x03FE560DBC45153C7  # 0.668073543008 973
+    .quad 0x03FE560DBC45153C7
+    .quad 0x03FE564C2A6059FE7  # 0.668549846899 974
+    .quad 0x03FE564C2A6059FE7
+    .quad 0x03FE568AA0194EC6E  # 0.669026377763 975
+    .quad 0x03FE568AA0194EC6E
+    .quad 0x03FE56C91D71CF810  # 0.669503135817 976
+    .quad 0x03FE56C91D71CF810
+    .quad 0x03FE5707A26BB8C66  # 0.669980121278 977
+    .quad 0x03FE5707A26BB8C66
+    .quad 0x03FE57462F08E7DF5  # 0.670457334363 978
+    .quad 0x03FE57462F08E7DF5
+    .quad 0x03FE5784C34B3AC30  # 0.670934775289 979
+    .quad 0x03FE5784C34B3AC30
+    .quad 0x03FE57C35F3490183  # 0.671412444273 980
+    .quad 0x03FE57C35F3490183
+    .quad 0x03FE580202C6C7353  # 0.671890341535 981
+    .quad 0x03FE580202C6C7353
+    .quad 0x03FE5840AE03C0204  # 0.672368467291 982
+    .quad 0x03FE5840AE03C0204
+    .quad 0x03FE589EBD437CA31  # 0.673086084831 983
+    .quad 0x03FE589EBD437CA31
+    .quad 0x03FE58DD7BB392B30  # 0.673564782782 984
+    .quad 0x03FE58DD7BB392B30
+    .quad 0x03FE591C41D500163  # 0.674043709994 985
+    .quad 0x03FE591C41D500163
+    .quad 0x03FE595B0FA9A7EF1  # 0.674522866688 986
+    .quad 0x03FE595B0FA9A7EF1
+    .quad 0x03FE5999E5336E121  # 0.675002253082 987
+    .quad 0x03FE5999E5336E121
+    .quad 0x03FE59D8C2743705E  # 0.675481869398 988
+    .quad 0x03FE59D8C2743705E
+    .quad 0x03FE5A17A76DE803B  # 0.675961715857 989
+    .quad 0x03FE5A17A76DE803B
+    .quad 0x03FE5A56942266F7B  # 0.676441792678 990
+    .quad 0x03FE5A56942266F7B
+    .quad 0x03FE5A9588939A810  # 0.676922100084 991
+    .quad 0x03FE5A9588939A810
+    .quad 0x03FE5AD484C369F2D  # 0.677402638296 992
+    .quad 0x03FE5AD484C369F2D
+    .quad 0x03FE5B1388B3BD53E  # 0.677883407536 993
+    .quad 0x03FE5B1388B3BD53E
+    .quad 0x03FE5B5294667D5F7  # 0.678364408027 994
+    .quad 0x03FE5B5294667D5F7
+    .quad 0x03FE5B91A7DD93852  # 0.678845639990 995
+    .quad 0x03FE5B91A7DD93852
+    .quad 0x03FE5BD0C31AE9E9D  # 0.679327103649 996
+    .quad 0x03FE5BD0C31AE9E9D
+    .quad 0x03FE5C2F7A8ED5E5B  # 0.680049734055 997
+    .quad 0x03FE5C2F7A8ED5E5B
+    .quad 0x03FE5C6EA94431EF9  # 0.680531777930 998
+    .quad 0x03FE5C6EA94431EF9
+    .quad 0x03FE5CADDFC6874F5  # 0.681014054284 999
+    .quad 0x03FE5CADDFC6874F5
+    .quad 0x03FE5CED1E17C35C6  # 0.681496563340 1000
+    .quad 0x03FE5CED1E17C35C6
+    .quad 0x03FE5D2C6439D4252  # 0.681979305324 1001
+    .quad 0x03FE5D2C6439D4252
+    .quad 0x03FE5D6BB22EA86F6  # 0.682462280460 1002
+    .quad 0x03FE5D6BB22EA86F6
+    .quad 0x03FE5DAB07F82FB84  # 0.682945488974 1003
+    .quad 0x03FE5DAB07F82FB84
+    .quad 0x03FE5DEA65985A350  # 0.683428931091 1004
+    .quad 0x03FE5DEA65985A350
+    .quad 0x03FE5E29CB1118D32  # 0.683912607038 1005
+    .quad 0x03FE5E29CB1118D32
+    .quad 0x03FE5E6938645D390  # 0.684396517040 1006
+    .quad 0x03FE5E6938645D390
+    .quad 0x03FE5EA8AD9419C5B  # 0.684880661324 1007
+    .quad 0x03FE5EA8AD9419C5B
+    .quad 0x03FE5EE82AA241920  # 0.685365040118 1008
+    .quad 0x03FE5EE82AA241920
+    .quad 0x03FE5F27AF90C8705  # 0.685849653648 1009
+    .quad 0x03FE5F27AF90C8705
+    .quad 0x03FE5F673C61A2ED2  # 0.686334502142 1010
+    .quad 0x03FE5F673C61A2ED2
+    .quad 0x03FE5FA6D116C64F7  # 0.686819585829 1011
+    .quad 0x03FE5FA6D116C64F7
+    .quad 0x03FE5FE66DB228992  # 0.687304904936 1012
+    .quad 0x03FE5FE66DB228992
+    .quad 0x03FE60261235C0874  # 0.687790459692 1013
+    .quad 0x03FE60261235C0874
+    .quad 0x03FE6065BEA385926  # 0.688276250325 1014
+    .quad 0x03FE6065BEA385926
+    .quad 0x03FE60A572FD6FEF1  # 0.688762277066 1015
+    .quad 0x03FE60A572FD6FEF1
+    .quad 0x03FE60E52F45788E4  # 0.689248540144 1016
+    .quad 0x03FE60E52F45788E4
+    .quad 0x03FE6124F37D991D4  # 0.689735039789 1017
+    .quad 0x03FE6124F37D991D4
+    .quad 0x03FE6164BFA7CC06C  # 0.690221776231 1018
+    .quad 0x03FE6164BFA7CC06C
+    .quad 0x03FE61A493C60C729  # 0.690708749700 1019
+    .quad 0x03FE61A493C60C729
+    .quad 0x03FE61E46FDA56466  # 0.691195960429 1020
+    .quad 0x03FE61E46FDA56466
+    .quad 0x03FE622453E6A6263  # 0.691683408647 1021
+    .quad 0x03FE622453E6A6263
+    .quad 0x03FE62643FECF9743  # 0.692171094587 1022
+    .quad 0x03FE62643FECF9743
+    .quad 0x03FE62A433EF4E51A  # 0.692659018480 1023
+    .quad 0x03FE62A433EF4E51A
+
+
+

diff --git a/src/gas/vrdasin.S b/src/gas/vrdasin.S
new file mode 100644
index 0000000..a5fb8d4
--- /dev/null
+++ b/src/gas/vrdasin.S

@@ -0,0 +1,3073 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdasin.s
+#
+# An array implementation of the sin libm function.
+#
+# Prototype:
+#
+#    void vrda_sin(int n, double *x, double *y);
+#
+#Computes Sine of x for an array of input values.
+#Places the results into the supplied y array.
+#Does not perform error checking.
+#Denormal inputs may produce unexpected results
+#Author: Harsha Jagasia
+#Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: 	.quad 0x07fffffffffffffff	#Sign bit zero
+				.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: 	.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+				.quad 0x03e40000000000000
+.L__real_3fe0000000000000: 	.quad 0x03fe0000000000000	# 0.5
+				.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: 	.quad 0x03fc5555555555555	# 0.166666666666
+				.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: 	.quad 0x03fe45f306dc9c883	# twobypi
+				.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: 	.quad 0x03ff921fb54400000	# piby2_1
+				.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: 	.quad 0x03dd0b4611a626331	# piby2_1tail
+				.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: 	.quad 0x03dd0b4611a600000	# piby2_2
+				.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: 	.quad 0x03ba3198a2e037073	# piby2_2tail
+				.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: 	.quad 0x0fffffffff8000000	# mask for stripping head and tail
+				.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+				.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+				.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+				.quad 0
+.L__reald_one_zero:		.quad 0x00000000100000000	# sin_cos_filter
+				.quad 0
+.L__reald_zero_one:		.quad 0x00000000000000001	#
+				.quad 0
+.L__reald_two_zero:		.quad 0x00000000200000000	#
+				.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+				.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+				.quad 0x03ff0000000000000	#
+.L__real_zero:			.quad 0x00000000000000000	#
+				.quad 0x00000000000000000	#
+.L__real_one:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.Lcosarray:
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x03fa5555555555555
+	.quad	0x0bf56c16c16c16967		# -0.00138889	   		c2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03efa01a019f4ec90		# 2.48016e-005			c3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0be927e4fa17f65f6		# -2.75573e-007			c4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03e21eeb69037ab78		# 2.08761e-009			c5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0bda907db46cc5e42		# -1.13826e-011	   		c6
+	.quad	0x0bda907db46cc5e42
+.Lsinarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bfc5555555555555
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x03f81111111110bb3
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0bf2a01a019e83e5c
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x03ec71de3796cde01
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0be5ae600b42fdfa7
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+	.quad	0x03de5e0b2f9a43bb8
+.Lsincosarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+	.quad	0x0bda907db46cc5e42
+.Lcossinarray:
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0bda907db46cc5e42
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+
+.Levensin_oddcos_tbl:
+		.quad	.Lsinsin_sinsin_piby4		# 0
+		.quad	.Lsinsin_sincos_piby4		# 1
+		.quad	.Lsinsin_cossin_piby4		# 2
+		.quad	.Lsinsin_coscos_piby4		# 3
+
+		.quad	.Lsincos_sinsin_piby4		# 4
+		.quad	.Lsincos_sincos_piby4		# 5
+		.quad	.Lsincos_cossin_piby4		# 6
+		.quad	.Lsincos_coscos_piby4		# 7
+
+		.quad	.Lcossin_sinsin_piby4		# 8
+		.quad	.Lcossin_sincos_piby4		# 9
+		.quad	.Lcossin_cossin_piby4		# 10
+		.quad	.Lcossin_coscos_piby4		# 11
+
+		.quad	.Lcoscos_sinsin_piby4		# 12
+		.quad	.Lcoscos_sincos_piby4		# 13
+		.quad	.Lcoscos_cossin_piby4		# 14
+		.quad	.Lcoscos_coscos_piby4		# 15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        .weak vrda_sin_
+        .set vrda_sin_,__vrda_sin__
+        .weak vrda_sin__
+        .set vrda_sin__,__vrda_sin__
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array sin
+#**     VRDA_SIN(N,X,Y)
+# C equivalent*/
+#void vrda_sin__(int * n, double *x, double *y)
+#{
+#       vrda_sin(*n,x,y);
+#}
+.globl __vrda_sin__
+    .type   __vrda_sin__,@function
+__vrda_sin__:
+    mov         (%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+
+# define local variable storage offsets
+.equ	p_temp,		0x00		# temporary for get/put bits operation
+.equ	p_temp1,	0x10		# temporary for get/put bits operation
+
+.equ	save_xmm6,	0x20		# temporary for get/put bits operation
+.equ	save_xmm7,	0x30		# temporary for get/put bits operation
+.equ	save_xmm8,	0x40		# temporary for get/put bits operation
+.equ	save_xmm9,	0x50		# temporary for get/put bits operation
+.equ	save_xmm10,	0x60		# temporary for get/put bits operation
+.equ	save_xmm11,	0x70		# temporary for get/put bits operation
+.equ	save_xmm12,	0x80		# temporary for get/put bits operation
+.equ	save_xmm13,	0x90		# temporary for get/put bits operation
+.equ	save_xmm14,	0x0A0		# temporary for get/put bits operation
+.equ	save_xmm15,	0x0B0		# temporary for get/put bits operation
+
+.equ	r,		0x0C0		# pointer to r for remainder_piby2
+.equ	rr,		0x0D0		# pointer to r for remainder_piby2
+.equ	region,		0x0E0		# pointer to r for remainder_piby2
+
+.equ	r1,		0x0F0		# pointer to r for remainder_piby2
+.equ	rr1,		0x0100		# pointer to r for remainder_piby2
+.equ	region1,	0x0110		# pointer to r for remainder_piby2
+
+.equ	p_temp2,	0x0120		# temporary for get/put bits operation
+.equ	p_temp3,	0x0130		# temporary for get/put bits operation
+
+.equ	p_temp4,	0x0140		# temporary for get/put bits operation
+.equ	p_temp5,	0x0150		# temporary for get/put bits operation
+
+.equ	p_original,	0x0160		# original x
+.equ	p_mask,		0x0170		# original x
+.equ	p_sign,		0x0180		# original x
+
+.equ	p_original1,	0x0190		# original x
+.equ	p_mask1,	0x01A0		# original x
+.equ	p_sign1,	0x01B0		# original x
+
+.equ	save_r12,	0x01C0		# temporary for get/put bits operation
+.equ	save_r13,	0x01D0		# temporary for get/put bits operation
+
+.equ	save_xa,	0x01E0		#qword
+.equ	save_ya,	0x01F0		#qword
+
+.equ	save_nv,	0x0200		#qword
+.equ	p_iter,		0x0210		# qword	storage for number of loop iterations
+
+
+.globl vrda_sin
+    .type   vrda_sin,@function
+vrda_sin:
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# parameters are passed in by Linux as:
+# rcx - int n
+# rdx - double *x
+# r8  - double *y
+
+	sub	$0x228,%rsp
+	mov	%r12,save_r12(%rsp)	# save r12
+	mov	%r13,save_r13(%rsp)	# save r13
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START PROCESS INPUT
+
+# save the arguments
+	mov		%rsi,	save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,	save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+	mov		%rdi,save_nv(%rsp)	# save number of values
+
+# see if too few values to call the main loop
+	shr		$2,%rax				# get number of iterations
+	jz		.L__vrda_cleanup			# jump if only single calls
+
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi		# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrda_top:
+
+# build the input _m128d
+	movapd	.L__real_7fffffffffffffff(%rip),%xmm2
+
+	mov	save_xa(%rsp),%rsi		# get x_array pointer
+	movlpd	(%rsi),%xmm0
+	movhpd	8(%rsi),%xmm0
+	mov	(%rsi),%rax
+	mov	8(%rsi),%rcx
+	movdqa	%xmm0,%xmm6
+
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+	movlpd	-16(%rsi), %xmm1
+	movhpd	-8(%rsi), %xmm1
+	mov	-16(%rsi), %r8
+	mov	-8(%rsi), %r9
+	movdqa	%xmm1,%xmm7
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+andpd 	%xmm2,%xmm0				#Unsign
+andpd 	%xmm2,%xmm1				#Unsign
+
+and	.L__real_7fffffffffffffff(%rip), %rax
+and	.L__real_7fffffffffffffff(%rip), %rcx
+and	.L__real_7fffffffffffffff(%rip), %r8
+and	.L__real_7fffffffffffffff(%rip), %r9
+
+movdqa	%xmm0,%xmm12
+movdqa	%xmm1,%xmm13
+
+pcmpgtd		%xmm6,%xmm12
+pcmpgtd		%xmm7,%xmm13
+movdqa		%xmm12,%xmm6
+movdqa		%xmm13,%xmm7
+psrldq		$4,%xmm12
+psrldq		$4,%xmm13
+psrldq		$8,%xmm6
+psrldq		$8,%xmm7
+
+mov 	$0x3FE921FB54442D18,%rdx			#piby4	+
+mov	$0x411E848000000000,%r10			#5e5	+
+movapd	.L__real_3fe0000000000000(%rip),%xmm4		#0.5 for later use	+
+
+por	%xmm6,%xmm12
+por	%xmm7,%xmm13
+movd	%xmm12,%r12				#Move Sign to gpr **
+movd	%xmm13,%r13				#Move Sign to gpr **
+
+movapd	%xmm0,%xmm2				#x0
+movapd	%xmm1,%xmm3				#x1
+movapd	%xmm0,%xmm6				#x0
+movapd	%xmm1,%xmm7				#x1
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax
+	jae	.Lfirst_or_next3_arg_gt_5e5
+
+	cmp	%r10,%rcx
+	jae	.Lsecond_or_next2_arg_gt_5e5
+
+	cmp	%r10,%r8
+	jae	.Lthird_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfourth_arg_gt_5e5
+
+
+#      /* Find out what multiple of piby2 */
+#        npi2  = (int)(x * twobypi + 0.5);
+	movapd	.L__real_3fe45f306dc9c883(%rip),%xmm0
+	mulpd	%xmm0,%xmm2						# * twobypi
+	mulpd	%xmm0,%xmm3						# * twobypi
+
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%r8						# Region
+	movd	%xmm5,%r9						# Region
+
+	mov 	.L__reald_one_zero(%rip),%rdx			#compare value for cossin path
+	mov	%r8,%r10
+	mov	%r9,%r11
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm1,%xmm7						# t-rhead
+
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm0 =rhead, xmm8 =rtail
+# xmm5  = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+	and	.L__reald_one_one(%rip),%r8			#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9			#odd/even region for cos/sin
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+# GET_BITS_DP64(rhead-rtail, uy);			   		; originally only rhead
+# xmm4  = Sign, xmm0 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 =rhead, xmm9 =rtail
+	movapd	%xmm0,%xmm6						# rhead
+	movapd	%xmm1,%xmm7						# rhead
+
+	subpd	%xmm8,%xmm0						# r = rhead - rtail
+	subpd	%xmm9,%xmm1						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+	subpd	%xmm0,%xmm6				#rr=rhead-r
+	subpd	%xmm1,%xmm7				#rr=rhead-r
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm0,%xmm2				# move r for r2
+	movapd	%xmm1,%xmm3				# move r for r2
+
+	mulpd	%xmm0,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	subpd	%xmm8,%xmm6				#rr=(rhead-r) -rtail
+	subpd	%xmm9,%xmm7				#rr=(rhead-r) -rtail
+
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+#DEBUG
+#	jmp	.Lfinal_check
+#DEBUG
+
+	leaq	 .Levensin_oddcos_tbl(%rip),%rsi
+	jmp	 *(%rsi,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm10, xmm12
+#	    %xmm11,,%xmm9 xmm13
+
+
+	movlpd	 %xmm0,r(%rsp)			#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm0,%xmm0			#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm8 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm12 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm10					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	%xmm6,%xmm0
+       subsd	%xmm10,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm0,%xmm6					# rr=rhead-r
+       subsd	%xmm10,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r+8(%rsp)			# store upper r
+       movlpd	 %xmm6,rr+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf
+
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+	movlpd	 r(%rsp),%xmm0				#Restore lower fp arg for remainder_piby2 call
+        call	__amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				# rr = 0
+	mov	 %r10d,region(%rsp)			# region =0
+
+.align 16
+0:
+	jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+	movhpd		%xmm0,r+8(%rsp)					#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+        call	__amd_remainder_piby2@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+#	mov	p_original(r%sp),%rax
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	%rax,r(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr(%rsp)				#rr = 0
+	mov	%r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf_of_both_gt_5e5
+
+
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call	__amd_remainder_piby2@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5:
+#	mov	p_original+8(%rsp),%rcx				;upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	%rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr+8(%rsp)			#rr = 0
+	mov	%r10d,region+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm10,,%xmm8 xmm12
+#   %xmm9,,%xmm5 xmm11, xmm13
+
+	movhpd	 %xmm0,r+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+#	movlhps	%xmm0,%xmm0			;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+#	movlhps	%xmm2,%xmm2
+#	movlhps	%xmm6,%xmm6
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm7 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm10					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store upper region
+       movsd	%xmm6,%xmm0
+       subsd	%xmm10,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm0,%xmm6					# rr=rhead-r
+       subsd	%xmm10,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r(%rsp)				# store upper r
+       movlpd	 %xmm6,rr(%rsp)				# store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf
+
+
+	mov	 %r8,p_temp(%rsp)
+	mov	 %r9,p_temp2(%rsp)
+	movapd	 %xmm1,p_temp1(%rsp)
+	movapd	 %xmm3,p_temp3(%rsp)
+	movapd	 %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+        call	__amd_remainder_piby2@PLT
+
+	mov	p_temp(%rsp),%r8
+	mov	p_temp2(%rsp),%r9
+	movapd	p_temp1(%rsp),%xmm1
+	movapd	p_temp3(%rsp),%xmm3
+	movapd	p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf:
+#	mov	p_original+8(%rsp),%rcx		; upper arg is nan/inf
+#	mov	r+8(%rsp),%rcx			; upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr+8(%rsp)			# rr = 0
+	mov	 %r10d,region+4(%rsp)			# region =0
+
+.align 16
+0:
+
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r8
+	jae	.Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4			#Restore 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm3						# * twobypi
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+       movq	 %xmm5,region1(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm1,%xmm7						# t-rhead
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm1,%xmm7						# rhead
+       subpd	%xmm9,%xmm1						# r = rhead - rtail
+       movapd	 %xmm1,r1(%rsp)
+
+       subpd	%xmm1,%xmm7						# rr=rhead-r
+       subpd	%xmm9,%xmm7						# rr=(rhead-r) -rtail
+       movapd	 %xmm7,rr1(%rsp)
+
+	jmp	.L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use 	%xmm11,,%xmm9 xmm13
+# 	%xmm8,,%xmm5 xmm10, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2						# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm4,region(%rsp)				# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm0,%xmm6						# rhead
+       subpd	%xmm8,%xmm0						# r = rhead - rtail
+       movapd	 %xmm0,r(%rsp)
+
+       subpd	%xmm0,%xmm6						# rr=rhead-r
+       subpd	%xmm8,%xmm6						# rr=(rhead-r) -rtail
+       movapd	 %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+	cmp	%r10,%r9
+	jae	.Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movlpd	 %xmm1,r1(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm1,%xmm1			#Needed since we want to work on upper arg
+	movhlps	%xmm3,%xmm3
+	movhlps	%xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r9d				# r9d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%r9d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm0					# xmm0 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r9d,region1+4(%rsp)			# store upper region
+       movsd	%xmm7,%xmm1
+       subsd	%xmm0,%xmm1					# xmm1 = r=(rhead-rtail)
+       subsd	%xmm1,%xmm7					# rr=rhead-r
+       subsd	%xmm0,%xmm7					# xmm7 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm1,r1+8(%rsp)			# store upper r
+       movlpd	 %xmm7,rr1+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf_higher
+
+	lea	 region1(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr1(%rsp),%rsi
+	lea	 r1(%rsp),%rdi
+	movlpd	 r1(%rsp),%xmm0				#Restore lower fp arg for remainder_piby2 call
+        call	__amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf_higher:
+#	mov	p_original1(%rsp),%r8			; upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1(%rsp)				# rr = 0
+	mov	 %r10d,region1(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.L__vrd4_sin_reconstruct
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+	movhpd 		%xmm1,r1+8(%rsp)		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher
+
+	mov	 %r9,p_temp1(%rsp)			#Save upper arg
+	lea	 region1(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr1(%rsp),%rsi
+	lea	 r1(%rsp),%rdi
+	movsd	 %xmm1,%xmm0
+        call	__amd_remainder_piby2@PLT
+	mov	 p_temp1(%rsp),%r9			#Restore upper arg
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher:				#lower arg is nan/inf
+#	mov	p_original1(%rsp),%r8
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	%r8,r1(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr1(%rsp)				#rr = 0
+	mov	%r10d,region1(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher
+
+	lea	 region1+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr1+8(%rsp),%rsi
+	lea	 r1+8(%rsp),%rdi
+	movlpd	 r1+8(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call	__amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher:
+#	mov	p_original1+8(%rsp),%r9			;upper arg is nan/inf
+#	movd	%xmm6,%r9				;upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	%r9,r1+8(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr1+8(%rsp)			#rr = 0
+	mov	%r10d,region1+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp .L__vrd4_sin_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2						# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm4,region(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm0,%xmm6						# rhead
+       subpd	%xmm8,%xmm0						# r = rhead - rtail
+       movapd	 %xmm0,r(%rsp)
+
+       subpd	%xmm0,%xmm6						# rr=rhead-r
+       subpd	%xmm8,%xmm6						# rr=(rhead-r) -rtail
+       movapd	 %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd	 %xmm1,r1+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+#	movlhps	%xmm1,%xmm1			;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+#	movlhps	%xmm3,%xmm3
+#	movlhps	%xmm7,%xmm7
+
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r8d				# r8d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%r8d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm0					# xmm0 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r8d,region1(%rsp)			# store lower region
+       movsd	%xmm7,%xmm1
+       subsd	%xmm0,%xmm1					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm1,%xmm7					# rr=rhead-r
+       subsd	%xmm0,%xmm7					# xmm6 = rr=((rhead-r) -rtail)
+
+       movlpd	 %xmm1,r1(%rsp)				# store upper r
+       movlpd	 %xmm7,rr1(%rsp)				# store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf_higher
+
+	lea	 region1+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr1+8(%rsp),%rsi
+	lea	 r1+8(%rsp),%rdi
+	movlpd	 r1+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+        call	__amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf_higher:
+#	mov	p_original1+8(%rsp),%r9		; upper arg is nan/inf
+#	mov	r1+8(%rsp),%r9				; upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	%r9,r1+8(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr1+8(%rsp)			# rr = 0
+	mov	%r10d,region1+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp	.L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	r(%rsp),%xmm0
+	movapd	r1(%rsp),%xmm1
+
+	movapd	rr(%rsp),%xmm6
+	movapd	rr1(%rsp),%xmm7
+
+	mov	region(%rsp),%r8
+	mov	region1(%rsp),%r9
+	mov 	.L__reald_one_zero(%rip),%rdx		#compare value for cossin path
+
+	mov 	%r8,%r10
+	mov 	%r9,%r11
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm0,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm0,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+	leaq	 .Levensin_oddcos_tbl(%rip),%rsi
+	jmp	 *(%rsi,%rax,8)	#Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_cleanup:
+
+	movapd	p_sign(%rsp),	%xmm0
+	movapd	p_sign1(%rsp),	%xmm1
+	xorpd	%xmm4,		%xmm0				# (+) Sign
+	xorpd	%xmm5,		%xmm1				# (+) Sign
+
+.L__vrda_bottom1:
+# store the result _m128d
+	mov	save_ya(%rsp),%rdi				# get y_array pointer
+	movlpd	%xmm0,(%rdi)
+	movhpd	%xmm0,8(%rdi)
+
+.L__vrda_bottom2:
+	prefetch	64(%rdi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)			# save y_array pointer
+
+# store the result _m128d
+	movlpd	%xmm1, -16(%rdi)
+	movhpd	%xmm1, -8(%rdi)
+
+	mov	p_iter(%rsp),%rax		# get number of iterations
+	sub	$1,%rax
+	mov	%rax,p_iter(%rsp)		# save number of iterations
+	jnz	.L__vrda_top
+
+# see if we need to do any extras
+	mov	save_nv(%rsp),%rax		# get number of values
+	test	%rax,%rax
+	jnz	.L__vrda_cleanup
+
+.L__final_check:
+
+	mov	save_r12(%rsp),%r12	# restore r12
+	mov	save_r13(%rsp),%r13	# restore r13
+
+	add	$0x228,%rsp
+	ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# The number of values left is in save_nv
+
+.align	16
+.L__vrda_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__final_check         # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorpd		 %xmm0,%xmm0
+	movlpd		 %xmm0,p_temp+8(%rsp)
+	movapd		 %xmm0,p_temp+16(%rsp)
+
+	mov		 (%rsi),%rcx			# we know there's at least one
+	mov	 	 %rcx,p_temp(%rsp)
+	cmp		 $2,%rax
+	jl		 .L__vrdacg
+
+	mov		 8(%rsi),%rcx			# do the second value
+	mov	 	 %rcx,p_temp+8(%rsp)
+	cmp		 $3,%rax
+	jl		 .L__vrdacg
+
+	mov		 16(%rsi),%rcx			# do the third value
+	mov	 	 %rcx,p_temp+16(%rsp)
+
+.L__vrdacg:
+	mov		 $4,%rdi			# parameter for N
+	lea		 p_temp(%rsp),%rsi		# &x parameter
+	lea		 p_temp2(%rsp),%rdx 		# &y parameter
+        call		vrda_sin@PLT			# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		 save_ya(%rsp),%rdi
+	mov		 save_nv(%rsp),%rax		# get number of values
+	mov	 	 p_temp2(%rsp),%rcx
+	mov		 %rcx, (%rdi)			# we know there's at least one
+	cmp		 $2,%rax
+	jl		 .L__vrdacgf
+
+	mov	 	 p_temp2+8(%rsp),%rcx
+	mov		 %rcx, 8(%rdi)			# do the second value
+	cmp		 $3,%rax
+	jl		 .L__vrdacgf
+
+	mov	 	 p_temp2+16(%rsp),%rcx
+	mov		 %rcx, 16(%rdi)			# do the third value
+
+.L__vrdacgf:
+	jmp		 .L__final_check
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+
+
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0	;trash r
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10		# 1 + (-t)	;trash t
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11		# 1 + (-t)	;trash t
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+
+	addpd	.Lcosarray(%rip),%xmm8				# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9				# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm2					# x4 recalculate
+	mulpd	%xmm3,%xmm3					# x4 recalculate
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12		# t relaculate, -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13		# t relaculate, -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subpd   %xmm13,%xmm5					# + t
+
+	jmp	.L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4		# s6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5		# s6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8		# s3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm10,%xmm10				# move high x4 for cos term
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	%xmm0,%xmm6				# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos
+	movhlps	%xmm3,%xmm13				# move high r for cos
+	movhlps	%xmm4,%xmm8				# xmm4 = sin , xmm8 = cos
+	movhlps	%xmm5,%xmm9				# xmm4 = sin , xmm8 = cos
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm7,%xmm5				# sin *x3
+	mulsd	%xmm10,%xmm8				# cos *x4
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	mulsd	p_temp(%rsp),%xmm2		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1(%rsp),%xmm3		# 0.5 * x2 * xx for sin term
+	movsd	%xmm12,%xmm6				# Keep high r for cos term
+	movsd	%xmm13,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	#-t=r-1.0
+
+	subsd	%xmm2,%xmm4				# sin - 0.5 * x2 *xx
+	subsd	%xmm3,%xmm5				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x*xx for cos term
+	movhlps	%xmm1,%xmm11				# move high x for x*xx for cos term
+
+	mulsd	p_temp+8(%rsp),%xmm10		# x * xx
+	mulsd	p_temp1+8(%rsp),%xmm11		# x * xx
+
+	movsd	%xmm12,%xmm2				# move -t for cos term
+	movsd	%xmm13,%xmm3				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm12 	#1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm13 	#1+(-t)
+	addsd	p_temp(%rsp),%xmm4			# sin+xx
+	addsd	p_temp1(%rsp),%xmm5			# sin+xx
+	subsd	%xmm6,%xmm12				# (1-t) - r
+	subsd	%xmm7,%xmm13				# (1-t) - r
+	subsd	%xmm10,%xmm12				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm11,%xmm13				# ((1 + (-t)) - r) - x*xx
+	addsd	%xmm0,%xmm4				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+	addsd   %xmm12,%xmm8				# cos+((1-t)-r - x*xx)
+	addsd   %xmm13,%xmm9				# cos+((1-t)-r - x*xx)
+	subsd   %xmm2,%xmm8				# cos+t
+	subsd   %xmm3,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_sin_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:					# changed from sincos_sincos
+							# xmm1 is cossin and xmm0 is sincos
+
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm1,p_temp3(%rsp)		# Store r for the sincos term
+
+	movapd	.Lsincosarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lcossinarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lsincosarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lcossinarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm10,%xmm10				# move high x4 for cos term
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	movhlps	%xmm3,%xmm7				# move high x2 for x3 for sin term (sincos)
+
+	mulsd	%xmm0,%xmm6				# get low x3 for sin term
+	mulsd	p_temp3+8(%rsp),%xmm7			# get high x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos (cossin)
+	movhlps	%xmm3,%xmm13				# move high 0.5*x2 for sin term (sincos)
+
+	movhlps	%xmm4,%xmm8				# xmm8 = cos , xmm4 = sin	(cossin)
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos	(sincos)
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm11,%xmm5				# cos *x4
+	mulsd	%xmm10,%xmm8				# cos *x4
+	mulsd	%xmm7,%xmm9				# sin *x3
+
+	mulsd	p_temp(%rsp),%xmm2		# low  0.5 * x2 * xx for sin term (cossin)
+	mulsd	p_temp1+8(%rsp),%xmm13		# high 0.5 * x2 * xx for sin term (sincos)
+
+	movsd	%xmm12,%xmm6				# Keep high r for cos term
+	movsd	%xmm3,%xmm7				# Keep low r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	# -t=r-1.0
+
+	subsd	%xmm2,%xmm4				# sin - 0.5 * x2 *xx	(cossin)
+	subsd	%xmm13,%xmm9				# sin - 0.5 * x2 *xx	(sincos)
+
+	movhlps	%xmm0,%xmm10				# move high x for x*xx for cos term (cossin)
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term    (sincos)
+
+	mulsd	p_temp+8(%rsp),%xmm10		# x * xx
+	mulsd	p_temp1(%rsp),%xmm1		# x * xx
+
+	movsd	%xmm12,%xmm2				# move -t for cos term
+	movsd	%xmm3,%xmm13				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm12	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm3 	# 1+(-t)
+
+	addsd	p_temp(%rsp),%xmm4		# sin+xx	+
+	addsd	p_temp1+8(%rsp),%xmm9		# sin+xx	+
+
+	subsd	%xmm6,%xmm12				# (1-t) - r
+	subsd	%xmm7,%xmm3				# (1-t) - r
+
+	subsd	%xmm10,%xmm12				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm1,%xmm3				# ((1 + (-t)) - r) - x*xx
+
+	addsd	%xmm0,%xmm4				# sin + x	+
+	addsd	%xmm11,%xmm9				# sin + x	+
+
+	addsd   %xmm12,%xmm8				# cos+((1-t)-r - x*xx)
+	addsd   %xmm3,%xmm5				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm2,%xmm8				# cos+t
+	subsd   %xmm13,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm0,p_temp2(%rsp)		# Store r
+	movapd	 %xmm1,p_temp3(%rsp)		# Store r
+
+
+	movapd	.Lcossinarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lcossinarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lcossinarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lcossinarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movhlps	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movhlps	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	p_temp2+8(%rsp),%xmm6		# get low x3 for sin term
+	mulsd	p_temp3+8(%rsp),%xmm7		# get low x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high 0.5*x2 for sin term
+	movhlps	%xmm3,%xmm13				# move high 0.5*x2 for sin term
+							# Reverse 12 and 2
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos
+
+	mulsd	%xmm6,%xmm8				# sin *x3
+	mulsd	%xmm7,%xmm9				# sin *x3
+	mulsd	%xmm10,%xmm4				# cos *x4
+	mulsd	%xmm11,%xmm5				# cos *x4
+
+	mulsd	p_temp+8(%rsp),%xmm12		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1+8(%rsp),%xmm13		# 0.5 * x2 * xx for sin term
+	movsd	%xmm2,%xmm6				# Keep high r for cos term
+	movsd	%xmm3,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	#-t=r-1.0
+
+	subsd	%xmm12,%xmm8				# sin - 0.5 * x2 *xx
+	subsd	%xmm13,%xmm9				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term
+							# Reverse 10 and 0
+
+	mulsd	p_temp(%rsp),%xmm0		# x * xx
+	mulsd	p_temp1(%rsp),%xmm1		# x * xx
+
+	movsd	%xmm2,%xmm12				# move -t for cos term
+	movsd	%xmm3,%xmm13				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm2 	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm3 	# 1+(-t)
+	addsd	p_temp+8(%rsp),%xmm8			# sin+xx
+	addsd	p_temp1+8(%rsp),%xmm9			# sin+xx
+
+	subsd	%xmm6,%xmm2				# (1-t) - r
+	subsd	%xmm7,%xmm3				# (1-t) - r
+
+	subsd	%xmm0,%xmm2				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm1,%xmm3				# ((1 + (-t)) - r) - x*xx
+
+	addsd	%xmm10,%xmm8				# sin + x
+	addsd	%xmm11,%xmm9				# sin + x
+
+	addsd   %xmm2,%xmm4				# cos+((1-t)-r - x*xx)
+	addsd   %xmm3,%xmm5				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm12,%xmm4				# cos+t
+	subsd   %xmm13,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:					# changed from sincos_sincos
+							# xmm1 is cossin and xmm0 is sincos
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	movapd	 %xmm6,p_temp(%rsp)		# Store rr
+	movapd	 %xmm7,p_temp1(%rsp)		# Store rr
+	movapd	 %xmm0,p_temp2(%rsp)		# Store r
+
+
+	movapd	.Lcossinarray+0x50(%rip),%xmm4		# s6
+	movapd	.Lsincosarray+0x50(%rip),%xmm5		# s6
+	movdqa	.Lcossinarray+0x20(%rip),%xmm8		# s3
+	movdqa	.Lsincosarray+0x20(%rip),%xmm9		# s3
+
+	movapd	%xmm2,%xmm10				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm10				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4		# s5+x2s6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# s5+x2s6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8		# s2+x2s3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9		# s2+x2s3
+
+	movapd	%xmm2,%xmm12				# move x2 for x6
+	movapd	%xmm3,%xmm13				# move x2 for x6
+
+	mulpd	%xmm2,%xmm4				# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5				# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8				# x2(s2+x2s3)
+	mulpd	%xmm3,%xmm9				# x2(s2+x2s3)
+
+	mulpd	%xmm10,%xmm12				# x6
+	mulpd	%xmm11,%xmm13				# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4		# s4+x2(s5+x2s6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5		# s4+x2(s5+x2s6)
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2(s2+x2s3)
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2(s2+x2s3)
+
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term	+
+
+	mulpd	%xmm12,%xmm4				# x6(s4+x2(s5+x2s6))
+	mulpd	%xmm13,%xmm5				# x6(s4+x2(s5+x2s6))
+
+	movhlps	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term   	+
+	mulsd	p_temp2+8(%rsp),%xmm6		# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term		+
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high 0.5*x2 for sin term
+	movhlps	%xmm3,%xmm13				# move high r for cos
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm9				# xmm9 = cos , xmm5 = sin
+
+	mulsd	%xmm6,%xmm8				# sin *x3
+	mulsd	%xmm11,%xmm9				# cos *x4
+	mulsd	%xmm10,%xmm4				# cos *x4
+	mulsd	%xmm7,%xmm5				# sin *x3
+
+	mulsd	p_temp+8(%rsp),%xmm12		# 0.5 * x2 * xx for sin term
+	mulsd	p_temp1(%rsp),%xmm3		# 0.5 * x2 * xx for sin term
+
+	movsd	%xmm2,%xmm6				# Keep high r for cos term
+	movsd	%xmm13,%xmm7				# Keep high r for cos term
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	#-t=r-1.0
+
+	subsd	%xmm12,%xmm8				# sin - 0.5 * x2 *xx
+	subsd	%xmm3,%xmm5				# sin - 0.5 * x2 *xx
+
+	movhlps	%xmm0,%xmm10				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x*xx for cos term
+
+	mulsd	p_temp(%rsp),%xmm0		# x * xx
+	mulsd	p_temp1+8(%rsp),%xmm11		# x * xx
+
+	movsd	%xmm2,%xmm12				# move -t for cos term
+	movsd	%xmm13,%xmm3				# move -t for cos term
+
+	addsd	.L__real_3ff0000000000000(%rip),%xmm2 	# 1+(-t)
+	addsd	.L__real_3ff0000000000000(%rip),%xmm13	# 1+(-t)
+
+	addsd	p_temp+8(%rsp),%xmm8		# sin+xx
+	addsd	p_temp1(%rsp),%xmm5		# sin+xx
+
+	subsd	%xmm6,%xmm2				# (1-t) - r
+	subsd	%xmm7,%xmm13				# (1-t) - r
+
+	subsd	%xmm0,%xmm2				# ((1 + (-t)) - r) - x*xx
+	subsd	%xmm11,%xmm13				# ((1 + (-t)) - r) - x*xx
+
+
+	addsd	%xmm10,%xmm8				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+
+	addsd   %xmm2,%xmm4				# cos+((1-t)-r - x*xx)
+	addsd   %xmm13,%xmm9				# cos+((1-t)-r - x*xx)
+
+	subsd   %xmm12,%xmm4				# cos+t
+	subsd   %xmm3,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+	jmp 	.L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	movapd	 %xmm2,p_temp2(%rsp)			# store x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	movapd	%xmm11,p_temp3(%rsp)			# store r
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for 0.5*x2
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm10					# x6
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm12	# 0.5 *x2
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm10,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm0,%xmm2					# x3 recalculate
+	mulpd	%xmm3,%xmm3					# x4 recalculate
+
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm6,%xmm12					# 0.5 * x2 *xx
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd	%xmm12,%xmm4					# -0.5 * x2 *xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd	%xmm6,%xmm4					# x3 * zs +xx
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# t relaculate, -t = r-1
+	addpd	%xmm0,%xmm4					# +x
+	subpd   %xmm13,%xmm5					# + t
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	movapd	 %xmm3,p_temp3(%rsp)			# store x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	movapd	 %xmm10,p_temp2(%rsp)			# store r
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	mulpd	%xmm3,%xmm11					# x4
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for 0.5*x2
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm11					# x6
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm13	# 0.5 *x2
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm2					# x4 recalculate
+	mulpd	%xmm1,%xmm3					# x3 recalculate
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm7,%xmm13					# 0.5 * x2 *xx
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x3 * zs
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;;
+	subpd	%xmm13,%xmm5					# -0.5 * x2 *xx
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd	%xmm7,%xmm5					# +xx
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12	# t relaculate, -t = r-1
+	addpd	%xmm1,%xmm5					# +x
+	subpd   %xmm12,%xmm4					# + t
+
+	jmp	.L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4:				#Derive from cossin_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+	movhlps	 %xmm10,%xmm10					# get upper r for t for cos
+
+	mulpd	 %xmm2,%xmm8					# c3*x2
+	mulpd	 %xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	movsd	%xmm0,%xmm8					# lower x for sin
+	mulsd	%xmm2,%xmm8					# lower x3 for sin
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# upper x4 for cos
+	movsd	%xmm8,%xmm2					# lower x3 for sin
+
+	movsd	%xmm6,%xmm9					# lower xx
+								# note using odd reg
+
+	movlpd   p_temp2+8(%rsp),%xmm12		# upper r for cos term
+	movapd   p_temp3(%rsp),%xmm13			# r
+
+	mulpd	%xmm0,%xmm6					# x * xx for upper cos term
+	mulpd	%xmm1,%xmm7					# x * xx
+	movhlps	%xmm6,%xmm6
+	mulsd	p_temp2(%rsp),%xmm9 			# xx * 0.5*x2 for sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+								# x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm8= cos, xmm4= sin
+
+	subsd	%xmm9,%xmm4					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addsd   %xmm10,%xmm8					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp(%rsp),%xmm4			# +xx
+
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	subsd   %xmm12,%xmm8					# + t
+	addsd	%xmm0,%xmm4					# +x
+	subpd   %xmm13,%xmm5					# + t
+
+	movlhps	%xmm8,%xmm4
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4:		#Derive from sincos_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcossinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcossinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+	addpd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t)
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zszc
+	addpd	%xmm9,%xmm5					# z
+
+	mulpd	%xmm0,%xmm2					# upper x3 for sin
+	mulsd	%xmm0,%xmm2					# lower x4 for cos
+	mulpd	%xmm3,%xmm3					# x4
+
+	movhlps	%xmm6,%xmm9					# upper xx for sin term
+								# note using odd reg
+
+	movlpd  p_temp2(%rsp),%xmm12			# lower r for cos term
+	movapd  p_temp3(%rsp),%xmm13			# r
+
+
+	mulpd	%xmm0,%xmm6					# x * xx for lower cos term
+	mulpd	%xmm1,%xmm7					# x * xx
+
+	mulsd	p_temp2+8(%rsp),%xmm9 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subpd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+	mulpd	%xmm3,%xmm5
+								# x4 * zc
+
+	movhlps	%xmm4,%xmm8					# xmm8= sin, xmm4= cos
+	subsd	%xmm9,%xmm8					# x3zs - 0.5*x2*xx
+
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subpd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addsd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp+8(%rsp),%xmm8			# +xx
+
+	movhlps	%xmm0,%xmm0					# upper x for sin
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+	subpd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	subsd   %xmm12,%xmm4					# + t
+	subpd   %xmm13,%xmm5					# + t
+	addsd	%xmm0,%xmm8					# +x
+
+	movlhps	%xmm8,%xmm4
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+	movhlps	%xmm11,%xmm11					# get upper r for t for cos
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10		# 1 + (-t)	;trash t
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11		# 1 + (-t)	;trash t
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8				# c1 + x2(c2+x2C3)
+	addpd	.Lsincosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zcs
+
+	movsd	%xmm1,%xmm9					# lower x for sin
+	mulsd	%xmm3,%xmm9					# lower x3 for sin
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# upper x4 for cos
+	movsd	%xmm9,%xmm3					# lower x3 for sin
+
+	movsd	 %xmm7,%xmm8					# lower xx
+								# note using even reg
+
+	movapd   p_temp2(%rsp),%xmm12			# r
+	movlpd   p_temp3+8(%rsp),%xmm13		# upper r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx for upper cos term
+	movhlps	%xmm7,%xmm7
+	mulsd	p_temp3(%rsp),%xmm8 			# xx * 0.5*x2 for sin term
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+								# x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= cos, xmm5= sin
+
+	subsd	%xmm8,%xmm5					# x3zs - 0.5*x2*xx
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd   %xmm11,%xmm9					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1(%rsp),%xmm5			# +xx
+
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12		# t relaculate, -t = r-1
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13		# t relaculate, -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subsd   %xmm13,%xmm9					# + t
+	addsd	%xmm1,%xmm5					# +x
+
+	movlhps	%xmm9,%xmm5
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:		# Derived from sincos_sinsin
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4				# c6
+	movdqa	.Lsincosarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8				# c3
+	movapd	.Lsincosarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	movhlps	%xmm11,%xmm11
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsincosarray+0x40(%rip),%xmm5		# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsincosarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm6,%xmm10					# 0.5*x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsincosarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zczs
+
+	movsd	%xmm3,%xmm12
+	mulsd	%xmm1,%xmm12					# low x3 for sin
+
+	mulpd	%xmm0, %xmm2					# x3
+	mulpd	%xmm3, %xmm3					# high x4 for cos
+	movsd	%xmm12,%xmm3					# low x3 for sin
+
+	movhlps	%xmm1,%xmm8					# upper x for cos term
+								# note using even reg
+	movlpd  p_temp3+8(%rsp),%xmm13			# upper r for cos term
+
+	mulsd	p_temp1+8(%rsp),%xmm8			# x * xx for upper cos term
+
+	mulsd	p_temp3(%rsp),%xmm7 			# xx * 0.5*x2 for lower sin term
+
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= cos, xmm5= sin
+
+	subsd	%xmm7,%xmm5					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm8,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm10,%xmm4					# x3*zs - 0.5*x2*xx
+	addsd   %xmm11,%xmm9					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1(%rsp),%xmm5			# +xx
+
+	addpd	%xmm6,%xmm4					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+
+	addsd	%xmm1,%xmm5					# +x
+	addpd	%xmm0,%xmm4					# +x
+	subsd   %xmm13,%xmm9					# + t
+
+	movlhps	%xmm9,%xmm5
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcossinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcossinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t)
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcossinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm7,%xmm8					# upper xx for sin term
+								# note using even reg
+
+	movapd  p_temp2(%rsp),%xmm12			# r
+	movlpd  p_temp3(%rsp),%xmm13			# lower r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx
+	mulpd	%xmm1,%xmm7					# x * xx for lower cos term
+
+	mulsd	p_temp3+8(%rsp),%xmm8 			# xx * 0.5*x2 for upper sin term
+
+	subpd   %xmm12,%xmm10					# (1 + (-t)) - r
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm8,%xmm9					# x3zs - 0.5*x2*xx
+
+	subpd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	addpd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1+8(%rsp),%xmm9				# +xx
+
+	movhlps	%xmm1,%xmm1					# upper x for sin
+	subpd	.L__real_3ff0000000000000(%rip),%xmm12		# -t = r-1
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13		# -t = r-1
+
+	subpd   %xmm12,%xmm4					# + t
+	subsd   %xmm13,%xmm5					# + t
+	addsd	%xmm1, %xmm9					# +x
+
+	movlhps	%xmm9, %xmm5
+
+	jmp	.L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsincos_sinsin_piby4:		# Derived from sincos_coscos
+	movapd	%xmm2,%xmm10					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lcossinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lcossinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm11,p_temp3(%rsp)			# r
+	movapd	 %xmm7,p_temp1(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lcossinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lcossinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm6,%xmm10					# 0.5x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm11	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lcossinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm0,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm7,%xmm8					# upper xx for sin term
+								# note using even reg
+
+	movlpd  p_temp3(%rsp),%xmm13			# lower r for cos term
+
+	mulpd	%xmm1,%xmm7					# x * xx for lower cos term
+
+	mulsd	p_temp3+8(%rsp),%xmm8 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm13,%xmm11					# (1 + (-t)) - r
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm5,%xmm9					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm8,%xmm9					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm7,%xmm11					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm10,%xmm4					# x3*zs - 0.5*x2*xx
+	addsd   %xmm11,%xmm5					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp1+8(%rsp),%xmm9			# +xx
+
+	movhlps	%xmm1,%xmm1					# upper x for sin
+	addpd	%xmm6,%xmm4					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13	# -t = r-1
+
+	addsd	%xmm1,%xmm9					# +x
+	addpd	%xmm0,%xmm4					# +x
+	subsd   %xmm13,%xmm5					# + t
+
+	movlhps	%xmm9,%xmm5
+
+	jmp	.L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsinsin_cossin_piby4:		# Derived from sincos_sinsin
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsincosarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsincosarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10	# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# x2
+	movapd	 %xmm6,p_temp(%rsp)			# xx
+
+	movhlps	%xmm10,%xmm10
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lsincosarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsincosarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm7,%xmm11					# 0.5*x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lsincosarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsincosarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zs
+
+
+	movsd	%xmm2,%xmm13
+	mulsd	%xmm0,%xmm13					# low x3 for sin
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm2,%xmm2					# high x4 for cos
+	movsd	%xmm13,%xmm2					# low x3 for sin
+
+
+	movhlps	%xmm0,%xmm9					# upper x for cos term								; note using even reg
+	movlpd  p_temp2+8(%rsp),%xmm12				# upper r for cos term
+	mulsd	p_temp+8(%rsp),%xmm9				# x * xx for upper cos term
+	mulsd	p_temp2(%rsp),%xmm6 				# xx * 0.5*x2 for lower sin term
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm8= cos, xmm4= sin
+	subsd	%xmm6,%xmm4					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm9,%xmm10					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm11,%xmm5					# x3*zs - 0.5*x2*xx
+
+	addsd   %xmm10,%xmm8					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp(%rsp),%xmm4			# +xx
+
+	addpd	%xmm7,%xmm5					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+
+	addsd	%xmm0,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+	subsd   %xmm12,%xmm8					# + t
+	movlhps	%xmm8,%xmm4
+
+	jmp	.L__vrd4_sin_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:		# Derived from sincos_coscos
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lcossinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lcossinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm10		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+
+	movapd	 %xmm10,p_temp2(%rsp)			# r
+	movapd	 %xmm6,p_temp(%rsp)			# rr
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0 for cos
+
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+
+	addpd	.Lcossinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lcossinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm7,%xmm11					# 0.5x2*xx
+	addsd   .L__real_3ff0000000000000(%rip),%xmm10	# 1 + (-t) for cos
+
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x6
+	mulpd	%xmm3,%xmm13					# x6
+
+	addpd	.Lcossinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lcossinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm12,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm13,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm0,%xmm2					# upper x3 for sin
+	mulsd	%xmm0,%xmm2					# lower x4 for cos
+
+	movhlps	%xmm6,%xmm9					# upper xx for sin term
+								# note using even reg
+
+	movlpd  p_temp2(%rsp),%xmm12			# lower r for cos term
+
+	mulpd	%xmm0,%xmm6					# x * xx for lower cos term
+
+	mulsd	p_temp2+8(%rsp),%xmm9 			# xx * 0.5*x2 for upper sin term
+
+	subsd   %xmm12,%xmm10					# (1 + (-t)) - r
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	movhlps	%xmm4,%xmm8					# xmm9= sin, xmm5= cos
+
+	subsd	%xmm9,%xmm8					# x3zs - 0.5*x2*xx
+
+	subsd   %xmm6,%xmm10					# ((1 + (-t)) - r) - x*xx
+
+	subpd	%xmm11,%xmm5					# x3*zs - 0.5*x2*xx
+	addsd   %xmm10,%xmm4					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addsd	p_temp+8(%rsp),%xmm8			# +xx
+
+	movhlps	%xmm0,%xmm0					# upper x for sin
+	addpd	%xmm7,%xmm5					# +xx
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12	# -t = r-1
+
+
+	addsd	%xmm0,%xmm8					# +x
+	addpd	%xmm1,%xmm5					# +x
+	subsd   %xmm12,%xmm4					# + t
+
+	movlhps	%xmm8,%xmm4
+
+	jmp	.L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#DEBUG
+#	xorpd   %xmm0, %xmm0
+#	xorpd   %xmm1, %xmm1
+#	jmp	.Lfinal_check
+#DEBUG
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# c6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# c6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# c3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# c3
+
+	movapd	 %xmm2,p_temp2(%rsp)			# copy of x2
+	movapd	 %xmm3,p_temp3(%rsp)			# copy of x2
+
+	mulpd	%xmm2,%xmm4					# c6*x2
+	mulpd	%xmm3,%xmm5					# c6*x2
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# c5+x2c6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# c5+x2c6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# c2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# c2+x2C3
+
+	mulpd	%xmm2,%xmm10					# x6
+	mulpd	%xmm3,%xmm11					# x6
+
+	mulpd	%xmm2,%xmm4					# x2(c5+x2c6)
+	mulpd	%xmm3,%xmm5					# x2(c5+x2c6)
+	mulpd	%xmm2,%xmm8					# x2(c2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(c2+x2C3)
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm2		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm3		# 0.5 *x2
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# c4 + x2(c5+x2c6)
+	addpd	.Lsinarray(%rip),%xmm8			# c1 + x2(c2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm6,%xmm2					# 0.5 * x2 *xx
+	mulpd	%xmm7,%xmm3					# 0.5 * x2 *xx
+
+	mulpd	%xmm10,%xmm4					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm5					# x6(c4 + x2(c5+x2c6))
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zs
+
+	movapd	p_temp2(%rsp),%xmm10			# x2
+	movapd	p_temp3(%rsp),%xmm11			# x2
+
+	mulpd	%xmm0,%xmm10					# x3
+	mulpd	%xmm1,%xmm11					# x3
+
+	mulpd	%xmm10,%xmm4					# x3 * zs
+	mulpd	%xmm11,%xmm5					# x3 * zs
+
+	subpd	%xmm2,%xmm4					# -0.5 * x2 *xx
+	subpd	%xmm3,%xmm5					# -0.5 * x2 *xx
+
+	addpd	%xmm6,%xmm4					# +xx
+	addpd	%xmm7,%xmm5					# +xx
+
+	addpd	%xmm0,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+
+	jmp	.L__vrd4_sin_cleanup

diff --git a/src/gas/vrdasincos.S b/src/gas/vrdasincos.S
new file mode 100644
index 0000000..d31e98a
--- /dev/null
+++ b/src/gas/vrdasincos.S

@@ -0,0 +1,1710 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdasincos.s
+#
+# An array implementation of the sincos libm function.
+#
+# Prototype:
+#
+#    void vrda_sincos(int n, double *x, double *ys, double *yc);
+#
+#Computes Sine of x for an array of input values.
+#Places the results into the supplied ys array.
+#Computes Cosine of x for an array of input values.
+#Places the results into the supplied yc array.
+#Does not perform error checking.
+#Denormal inputs may produce unexpected results
+#Author: Harsha Jagasia
+#Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: 	.quad 0x07fffffffffffffff	#Sign bit zero
+				.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: 	.quad 0x03ff0000000000000	# 1.0
+				.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+				.quad 0x03e40000000000000
+.L__real_3fe0000000000000: 	.quad 0x03fe0000000000000	# 0.5
+				.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: 	.quad 0x03fc5555555555555	# 0.166666666666
+				.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: 	.quad 0x03fe45f306dc9c883	# twobypi
+				.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: 	.quad 0x03ff921fb54400000	# piby2_1
+				.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: 	.quad 0x03dd0b4611a626331	# piby2_1tail
+				.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: 	.quad 0x03dd0b4611a600000	# piby2_2
+				.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: 	.quad 0x03ba3198a2e037073	# piby2_2tail
+				.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: 	.quad 0x0fffffffff8000000	# mask for stripping head and tail
+				.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+				.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+				.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+				.quad 0
+.L__reald_one_zero:		.quad 0x00000000100000000	# sin_cos_filter
+				.quad 0
+.L__reald_zero_one:		.quad 0x00000000000000001	#
+				.quad 0
+.L__reald_two_zero:		.quad 0x00000000200000000	#
+				.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+				.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+				.quad 0x03ff0000000000000	#
+.L__real_zero:			.quad 0x00000000000000000	#
+				.quad 0x00000000000000000	#
+.L__real_one:			.quad 0x00000000000000001	#
+				.quad 0x00000000000000001	#
+.L__real_jt_mask:		.quad 0x0000000000000000F	#
+				.quad 0x00000000000000000	#
+.L__real_naninf_upper_sign_mask:	.quad 0x000000000ffffffff	#
+					.quad 0x000000000ffffffff	#
+.L__real_naninf_lower_sign_mask:	.quad 0x0ffffffff00000000	#
+					.quad 0x0ffffffff00000000	#
+
+.Lcosarray:
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x03fa5555555555555
+	.quad	0x0bf56c16c16c16967		# -0.00138889	   		c2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03efa01a019f4ec90		# 2.48016e-005			c3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0be927e4fa17f65f6		# -2.75573e-007			c4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03e21eeb69037ab78		# 2.08761e-009			c5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0bda907db46cc5e42		# -1.13826e-011	   		c6
+	.quad	0x0bda907db46cc5e42
+.Lsinarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bfc5555555555555
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x03f81111111110bb3
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0bf2a01a019e83e5c
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x03ec71de3796cde01
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0be5ae600b42fdfa7
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+	.quad	0x03de5e0b2f9a43bb8
+.Lsincosarray:
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x0bf56c16c16c16967
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x03efa01a019f4ec90
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x03e21eeb69037ab78
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+	.quad	0x0bda907db46cc5e42
+
+
+.Lcossinarray:
+	.quad	0x03fa5555555555555		# 0.0416667		   	c1
+	.quad	0x0bfc5555555555555		# -0.166667	   		s1
+	.quad	0x0bf56c16c16c16967
+	.quad	0x03f81111111110bb3		# 0.00833333	   		s2
+	.quad	0x03efa01a019f4ec90
+	.quad	0x0bf2a01a019e83e5c		# -0.000198413			s3
+	.quad	0x0be927e4fa17f65f6
+	.quad	0x03ec71de3796cde01		# 2.75573e-006			s4
+	.quad	0x03e21eeb69037ab78
+	.quad	0x0be5ae600b42fdfa7		# -2.50511e-008			s5
+	.quad	0x0bda907db46cc5e42
+	.quad	0x03de5e0b2f9a43bb8		# 1.59181e-010	   		s6
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        .weak vrda_sincos_
+        .set vrda_sincos_,__vrda_sincos__
+        .weak vrda_sincos__
+        .set vrda_sincos__,__vrda_sincos__
+
+.text
+    .align 16
+    .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array sincos
+#**     VRDA_SINCOS(N,X,YS,YC)
+# C equivalent*/
+#void vrda_sincos__( int * n, double *x, double *ys, double *yc)
+#{
+#       vrda_sincos(*n,x,y);
+#}
+.globl __vrda_sincos__
+    .type   __vrda_sincos__,@function
+__vrda_sincos__:
+    mov         (%rdi),%edi
+.align 16
+.p2align 4,,15
+
+# define local variable storage offsets
+.equ	save_xmm6,	0x00		# temporary for get/put bits operation
+.equ	save_xmm7,	0x10		# temporary for get/put bits operation
+.equ	save_xmm8,	0x20		# temporary for get/put bits operation
+.equ	save_xmm9,	0x30		# temporary for get/put bits operation
+.equ	save_xmm10,	0x40		# temporary for get/put bits operation
+.equ	save_xmm11,	0x50		# temporary for get/put bits operation
+.equ	save_xmm12,	0x60		# temporary for get/put bits operation
+.equ	save_xmm13,	0x70		# temporary for get/put bits operation
+.equ	save_xmm14,	0x80		# temporary for get/put bits operation
+.equ	save_xmm15,	0x90		# temporary for get/put bits operation
+
+.equ	save_rdi,	0x0A0
+.equ	save_rsi,	0x0B0
+.equ	save_rbx,	0x0C0
+
+.equ	r,		0x0D0		# pointer to r for remainder_piby2
+.equ	rr,		0x0E0		# pointer to r for remainder_piby2
+.equ	rsq,		0x0F0
+.equ	region,		0x0100		# pointer to r for remainder_piby2
+
+.equ	r1,		0x0110		# pointer to r for remainder_piby2
+.equ	rr1,		0x0120		# pointer to r for remainder_piby2
+.equ	rsq1,		0x0130
+.equ	region1,	0x0140		# pointer to r for remainder_piby2
+
+.equ	p_temp,		0x0150		# temporary for get/put bits operation
+.equ	p_temp1,	0x0160		# temporary for get/put bits operation
+
+.equ	p_temp2,	0x0170		# temporary for get/put bits operation
+.equ	p_temp3,	0x0180		# temporary for get/put bits operation
+
+.equ	p_temp4,	0x0190		# temporary for get/put bits operation
+.equ	p_temp5,	0x01A0		# temporary for get/put bits operation
+
+.equ	p_temp6,	0x01B0		# temporary for get/put bits operation
+.equ	p_temp7,	0x01C0		# temporary for get/put bits operation
+
+.equ	p_original,	0x01D0		# original x
+.equ	p_mask,		0x01E0		# original x
+.equ	p_signs,	0x01F0		# original x
+.equ	p_signc,	0x0200		# original x
+.equ	p_region,	0x0210
+
+.equ	p_original1,	0x0220		# original x
+.equ	p_mask1,	0x0230		# original x
+.equ	p_signs1,	0x0240		# original x
+.equ	p_signc1,	0x0250		# original x
+.equ	p_region1,	0x0260
+
+.equ	save_r12,	0x0270		# temporary for get/put bits operation
+.equ	save_r13,	0x0280		# temporary for get/put bits operation
+
+.equ	save_r14,	0x0290		# temporary for get/put bits operation
+.equ	save_r15,	0x02A0		# temporary for get/put bits operation
+
+.equ	save_xa,	0x02B0		# qword ; leave space for 4 args*****
+.equ	save_ysa,	0x02C0		# qword ; leave space for 4 args*****
+.equ	save_yca,	0x02D0		# qword ; leave space for 4 args*****
+
+.equ	save_nv,	0x02E0		# qword
+.equ	p_iter,		0x02F0		# qword	storage for number of loop iterations
+
+
+.globl vrda_sincos
+    .type   vrda_sincos,@function
+vrda_sincos:
+
+	sub		$0x0308,%rsp
+
+	mov		%r12,save_r12(%rsp)		# save r12
+	mov		%r13,save_r13(%rsp)		# save r13
+	mov		%rbx,save_rbx(%rsp)		# save rbx
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START PROCESS INPUT
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ysa(%rsp)	# save ysin_array pointer
+	mov		%rcx,save_yca(%rsp)	# save ycos_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+
+	mov		%rdi,save_nv(%rsp)	# save number of values
+						# see if too few values to call the main loop
+	shr		$2,%rax			# get number of iterations
+	jz		.L__vrda_cleanup	# jump if only single calls
+						# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi		# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrda_top:
+# build the input _m128d
+	movapd	.L__real_7fffffffffffffff(%rip),%xmm2
+	mov	.L__real_7fffffffffffffff(%rip),%rdx
+
+	mov	save_xa(%rsp),%rsi	# get x_array pointer
+	movlpd	(%rsi),%xmm0
+	movhpd	8(%rsi),%xmm0
+	mov	(%rsi),%rax
+	mov	8(%rsi),%rcx
+	movdqa	%xmm0,%xmm6
+	movdqa  %xmm0,p_original(%rsp)
+
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+	movlpd	-16(%rsi), %xmm1
+	movhpd	-8(%rsi), %xmm1
+	mov	-16(%rsi), %r8
+	mov	-8(%rsi),  %r9
+	movdqa	 %xmm1,%xmm7
+	movdqa   %xmm1,p_original1(%rsp)
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+andpd 	%xmm2,%xmm0	#Unsign
+andpd 	%xmm2,%xmm1	#Unsign
+
+and	%rdx,%rax
+and	%rdx,%rcx
+and	%rdx,%r8
+and	%rdx,%r9
+
+movdqa	%xmm0,%xmm12
+movdqa	%xmm1,%xmm13
+
+pcmpgtd		%xmm6,%xmm12
+pcmpgtd		%xmm7,%xmm13
+movdqa		%xmm12,%xmm6
+movdqa		%xmm13,%xmm7
+psrldq		$4,%xmm12
+psrldq		$4,%xmm13
+psrldq		$8,%xmm6
+psrldq		$8,%xmm7
+
+mov 	$0x3FE921FB54442D18,%rdx			#piby4	+
+mov	$0x411E848000000000,%r10			#5e5	+
+movapd	.L__real_3fe0000000000000(%rip),%xmm4		#0.5 for later use	+
+
+por	%xmm6,%xmm12
+por	%xmm7,%xmm13
+movd	%xmm12,%r12				#Move Sign to gpr **
+movd	%xmm13,%r13				#Move Sign to gpr **
+
+movapd	%xmm0,%xmm2				#x0
+movapd	%xmm1,%xmm3				#x1
+movapd	%xmm0,%xmm6				#x0
+movapd	%xmm1,%xmm7				#x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5, xmm6 =x
+# xmm3 = x, xmm5 =0.5, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax
+	jae	.Lfirst_or_next3_arg_gt_5e5
+
+	cmp	%r10,%rcx
+	jae	.Lsecond_or_next2_arg_gt_5e5
+
+	cmp	%r10,%r8
+	jae	.Lthird_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfourth_arg_gt_5e5
+
+
+#      /* Find out what multiple of piby2 */
+#        npi2  = (int)(x * twobypi + 0.5);
+	movapd	.L__real_3fe45f306dc9c883(%rip),%xmm0
+	mulpd	%xmm0,%xmm2						# * twobypi
+	mulpd	%xmm0,%xmm3						# * twobypi
+
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+
+	xorpd	%xmm12,%xmm12
+
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%r8						# Region
+	movd	%xmm5,%r9						# Region
+
+	mov 	.L__reald_one_zero(%rip),%rdx			# compare value for cossin path
+	mov	%r8,%r10						# For Sign of Sin
+	mov	%r9,%r11
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm1,%xmm7						# t-rhead
+
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm0 =rhead, xmm8 =rtail
+# xmm5  = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+	pand	.L__reald_one_one(%rip),%xmm4			#odd/even region for cos/sin
+	pand	.L__reald_one_one(%rip),%xmm5			#odd/even region for cos/sin
+
+	pcmpeqd		%xmm12,%xmm4
+	pcmpeqd		%xmm12,%xmm5
+
+	punpckldq	%xmm4,%xmm4
+	punpckldq	%xmm5,%xmm5
+
+	movapd		 %xmm4,p_region(%rsp)
+	movapd		 %xmm5,p_region1(%rsp)
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_signs(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_signs+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_signs1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_signs1+8(%rsp)		#write out upper sign bit
+
+# GET_BITS_DP64(rhead-rtail, uy);			   		; originally only rhead
+# xmm4  = Sign, xmm0 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 =rhead, xmm9 =rtail
+	movapd	%xmm0,%xmm6						# rhead
+	movapd	%xmm1,%xmm7						# rhead
+
+	subpd	%xmm8,%xmm0						# r = rhead - rtail
+	subpd	%xmm9,%xmm1						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+	subpd	%xmm0,%xmm6				#rr=rhead-r
+	subpd	%xmm1,%xmm7				#rr=rhead-r
+
+	movapd	%xmm0,%xmm2				#move r for r2
+	movapd	%xmm1,%xmm3				#move r for r2
+
+	mulpd	%xmm0,%xmm2				#r2
+	mulpd	%xmm1,%xmm3				#r2
+
+	subpd	%xmm8,%xmm6				#rr=(rhead-r) -rtail
+	subpd	%xmm9,%xmm7				#rr=(rhead-r) -rtail
+
+
+	add	.L__reald_one_one(%rip),%r8
+	add	.L__reald_one_one(%rip),%r9
+
+	and	.L__reald_two_two(%rip),%r8
+	and	.L__reald_two_two(%rip),%r9
+
+	shr	$1,%r8
+	shr	$1,%r9
+
+	mov	%r8,%r12
+	mov	%r9,%r13
+
+	and	.L__reald_one_zero(%rip),%r12	#mask out the lower sign bit leaving the upper sign bit
+	and	.L__reald_one_zero(%rip),%r13	#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r8				#shift lower sign bit left by 63 bits
+	shl	$63,%r9				#shift lower sign bit left by 63 bits
+
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r8,p_signc(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_signc+8(%rsp)		#write out upper sign bit
+	mov 	 %r9,p_signc1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_signc1+8(%rsp)		#write out upper sign bit
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsinsin_sinsin_piby4:
+
+	movapd	 %xmm0,p_temp(%rsp)			# copy of x
+	movapd	 %xmm1,p_temp1(%rsp)			# copy of x
+
+	movapd	%xmm2,%xmm10					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x50(%rip),%xmm4			# s6
+	movdqa	.Lsinarray+0x50(%rip),%xmm5			# s6
+	movapd	.Lsinarray+0x20(%rip),%xmm8			# s3
+	movapd	.Lsinarray+0x20(%rip),%xmm9			# s3
+
+	movdqa	.Lcosarray+0x50(%rip),%xmm12			# c6
+	movdqa	.Lcosarray+0x50(%rip),%xmm13			# c6
+	movapd	.Lcosarray+0x20(%rip),%xmm14			# c3
+	movapd	.Lcosarray+0x20(%rip),%xmm15			# c3
+
+	movapd	 %xmm2,p_temp2(%rsp)			# copy of x2
+	movapd	 %xmm3,p_temp3(%rsp)			# copy of x2
+
+	mulpd	%xmm2,%xmm4					# s6*x2
+	mulpd	%xmm3,%xmm5					# s6*x2
+	mulpd	%xmm2,%xmm8					# s3*x2
+	mulpd	%xmm3,%xmm9					# s3*x2
+
+	mulpd	%xmm2,%xmm12					# s6*x2
+	mulpd	%xmm3,%xmm13					# s6*x2
+	mulpd	%xmm2,%xmm14					# s3*x2
+	mulpd	%xmm3,%xmm15					# s3*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	addpd	.Lsinarray+0x40(%rip),%xmm4			# s5+x2s6
+	addpd	.Lsinarray+0x40(%rip),%xmm5			# s5+x2s6
+	addpd	.Lsinarray+0x10(%rip),%xmm8			# s2+x2C3
+	addpd	.Lsinarray+0x10(%rip),%xmm9			# s2+x2C3
+
+	addpd	.Lcosarray+0x40(%rip),%xmm12			# c5+x2c6
+	addpd	.Lcosarray+0x40(%rip),%xmm13			# c5+x2c6
+	addpd	.Lcosarray+0x10(%rip),%xmm14			# c2+x2C3
+	addpd	.Lcosarray+0x10(%rip),%xmm15			# c2+x2C3
+
+	mulpd	%xmm2,%xmm10					# x6
+	mulpd	%xmm3,%xmm11					# x6
+
+	mulpd	%xmm2,%xmm4					# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm5					# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm8					# x2(s2+x2C3)
+	mulpd	%xmm3,%xmm9					# x2(s2+x2C3)
+
+	mulpd	%xmm2,%xmm12					# x2(s5+x2s6)
+	mulpd	%xmm3,%xmm13					# x2(s5+x2s6)
+	mulpd	%xmm2,%xmm14					# x2(s2+x2C3)
+	mulpd	%xmm3,%xmm15					# x2(s2+x2C3)
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm2		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm3		# 0.5 *x2
+
+	addpd	.Lsinarray+0x30(%rip),%xmm4			# s4 + x2(s5+x2s6)
+	addpd	.Lsinarray+0x30(%rip),%xmm5			# s4 + x2(s5+x2s6)
+	addpd	.Lsinarray(%rip),%xmm8			# s1 + x2(s2+x2C3)
+	addpd	.Lsinarray(%rip),%xmm9			# s1 + x2(s2+x2C3)
+
+	movapd	 %xmm2,p_temp4(%rsp)			# copy of r
+	movapd	 %xmm3,p_temp5(%rsp)			# copy of r
+
+	movapd	%xmm2,%xmm0					# r
+	movapd	%xmm3,%xmm1					# r
+
+	addpd	.Lcosarray+0x30(%rip),%xmm12			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray+0x30(%rip),%xmm13			# c4 + x2(c5+x2c6)
+	addpd	.Lcosarray(%rip),%xmm14			# c1 + x2(c2+x2C3)
+	addpd	.Lcosarray(%rip),%xmm15			# c1 + x2(c2+x2C3)
+
+	mulpd	%xmm6,%xmm2					# 0.5 * x2 *xx
+	mulpd	%xmm7,%xmm3					# 0.5 * x2 *xx
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0
+	subpd	.L__real_3ff0000000000000(%rip),%xmm1		# -t=r-1.0
+
+	mulpd	%xmm10,%xmm4					# x6(s4 + x2(s5+x2s6))
+	mulpd	%xmm11,%xmm5					# x6(s4 + x2(s5+x2s6))
+
+	mulpd	%xmm10,%xmm12					# x6(c4 + x2(c5+x2c6))
+	mulpd	%xmm11,%xmm13					# x6(c4 + x2(c5+x2c6))
+
+	addpd   .L__real_3ff0000000000000(%rip),%xmm0		# 1+(-t)
+	addpd   .L__real_3ff0000000000000(%rip),%xmm1		# 1+(-t)
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zs
+
+	addpd	%xmm14,%xmm12					# zc
+	addpd	%xmm15,%xmm13					# zc
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = 0.5 * x2 *xx, xmm4 = zs, xmm12 = zc, xmm6 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = 0.5 * x2 *xx, xmm5 = zs, xmm13 = zc, xmm7 =rr
+
+# Free
+# %xmm8,,%xmm10 xmm14
+# %xmm9,,%xmm11 xmm15
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	p_temp2(%rsp),%xmm10			# x2 for x3
+	movapd	p_temp3(%rsp),%xmm11			# x2 for x3
+
+	movapd	%xmm10,%xmm8					# x2 for x4
+	movapd	%xmm11,%xmm9					# x2 for x4
+
+	movapd	p_temp(%rsp),%xmm14			# x for x*xx
+	movapd	p_temp1(%rsp),%xmm15			# x for x*xx
+
+	subpd   p_temp4(%rsp),%xmm0			# (1 + (-t)) - r
+	subpd   p_temp5(%rsp),%xmm1			# (1 + (-t)) - r
+
+	mulpd	%xmm14,%xmm10					# x3
+	mulpd	%xmm15,%xmm11					# x3
+
+	mulpd	%xmm8,%xmm8					# x4
+	mulpd	%xmm9,%xmm9					# x4
+
+	mulpd	%xmm6,%xmm14					# x*xx
+	mulpd	%xmm7,%xmm15					# x*xx
+
+	mulpd	%xmm10,%xmm4					# x3 * zs
+	mulpd	%xmm11,%xmm5					# x3 * zs
+
+	mulpd	%xmm8,%xmm12					# x4 * zc
+	mulpd	%xmm9,%xmm13					# x4 * zc
+
+	subpd	%xmm2,%xmm4					# x3*zs-0.5 * x2 *xx
+	subpd	%xmm3,%xmm5					# x3*zs-0.5 * x2 *xx
+
+	subpd   %xmm14,%xmm0					# ((1 + (-t)) - r) -x*xx
+	subpd   %xmm15,%xmm1					# ((1 + (-t)) - r) -x*xx
+
+
+	movapd	p_temp4(%rsp),%xmm10			# r for t
+	movapd	p_temp5(%rsp),%xmm11			# r for t
+
+	addpd	%xmm6,%xmm4					# sin+xx
+	addpd	%xmm7,%xmm5					# sin+xx
+
+	addpd   %xmm0,%xmm12					# x4*zc + (((1 + (-t)) - r) - x*xx)
+	addpd   %xmm1,%xmm13					# x4*zc + (((1 + (-t)) - r) - x*xx)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm10	# -t=r-1.0
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+
+	movapd	p_region(%rsp),%xmm2
+	movapd	p_region1(%rsp),%xmm3
+
+	movapd	%xmm2,%xmm8
+	movapd	%xmm3,%xmm9
+
+	addpd	p_temp(%rsp),%xmm4			# sin+xx+x
+	addpd	p_temp1(%rsp),%xmm5			# sin+xx+x
+
+	subpd	%xmm10,%xmm12					# cos + (-t)
+	subpd	%xmm11,%xmm13					# cos + (-t)
+
+# xmm4	= sin, xmm5  = sin
+# xmm12	= cos, xmm13 = cos
+
+	andnpd	%xmm4,%xmm8
+	andnpd	%xmm5,%xmm9
+
+	andpd	%xmm2,%xmm4
+	andpd	%xmm3,%xmm5
+
+	andnpd	%xmm12,%xmm2
+	andnpd	%xmm13,%xmm3
+
+	andpd	p_region(%rsp),%xmm12
+	andpd	p_region1(%rsp),%xmm13
+
+	orpd	%xmm2,%xmm4
+	orpd	%xmm3,%xmm5
+
+	orpd	%xmm8,%xmm12
+	orpd	%xmm9,%xmm13
+
+	jmp 	.L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm10, xmm12
+#	    %xmm11,,%xmm9 xmm13
+
+
+	movlpd	 %xmm0,r(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm0,%xmm0			#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm8 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm12 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm10					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	 %xmm6,%xmm0
+       subsd	 %xmm10,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	 %xmm0,%xmm6					# rr=rhead-r
+       subsd	 %xmm10,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r+8(%rsp)			# store upper r
+       movlpd	 %xmm6,rr+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+	movlpd	 r(%rsp),%xmm0	#Restore lower fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf:
+	mov	p_original(%rsp),%rax			# upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				# rr = 0
+	mov	 %r10d,region(%rsp)			# region =0
+	and 	.L__real_naninf_lower_sign_mask(%rip),%r12	# Sign
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+	movhpd 	%xmm0,r+8(%rsp)				#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr(%rsp),%rsi
+	lea	 r(%rsp),%rdi
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+	mov	p_original(%rsp),%rax
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr(%rsp)				#rr = 0
+	mov	 %r10d,region(%rsp)			#region = 0
+	and 	.L__real_naninf_lower_sign_mask(%rip),%r12	# Sign
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf_of_both_gt_5e5
+
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5:
+	mov	p_original+8(%rsp),%rcx				#upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	%rcx,r+8(%rsp)					#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr+8(%rsp)					#rr = 0
+	mov	%r10d,region+4(%rsp)				#region = 0
+	and 	.L__real_naninf_upper_sign_mask(%rip),%r12	# Sign
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm10,,%xmm8 xmm12
+#   %xmm9,,%xmm5 xmm11, xmm13
+
+	movhpd	 %xmm0,r+8(%rsp)		#Save upper fp arg for remainder_piby2 call
+#	movlhps	%xmm0,%xmm0			#Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+#	movlhps	%xmm2,%xmm2
+#	movlhps	%xmm6,%xmm6
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm7 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm10					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store upper region
+       movsd	%xmm6,%xmm0
+       subsd	%xmm10,%xmm0					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm0,%xmm6					# rr=rhead-r
+       subsd	%xmm10,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm0,r(%rsp)				# store upper r
+       movlpd	 %xmm6,rr(%rsp)				# store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf
+
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr+8(%rsp),%rsi
+	lea	 r+8(%rsp),%rdi
+	movlpd	 r+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+
+	mov	p_temp(%rsp),%r8
+	mov	p_temp2(%rsp),%r9
+	movapd	p_temp1(%rsp),%xmm1
+	movapd	p_temp3(%rsp),%xmm3
+	movapd	p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf:
+	mov	p_original+8(%rsp),%rcx		# upper arg is nan/inf
+#	mov	r+8(%rsp),%rcx				; upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr+8(%rsp)			# rr = 0
+	mov	 %r10d,region+4(%rsp)			# region =0
+	and 	.L__real_naninf_upper_sign_mask(%rip),%r12	# Sign
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r8
+	jae	.Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfirst_second_done_fourth_arg_gt_5e5
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4			# Restore 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm3						# * twobypi
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm5,region1(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm1,%xmm7						# t-rhead
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm1,%xmm7						# rhead
+       subpd	%xmm9,%xmm1						# r = rhead - rtail
+       movapd	%xmm1,r1(%rsp)
+
+       subpd	%xmm1,%xmm7						# rr=rhead-r
+       subpd	%xmm9,%xmm7						# rr=(rhead-r) -rtail
+       movapd	%xmm7,rr1(%rsp)
+
+	jmp	.L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use 	%xmm11,,%xmm9 xmm13
+# 	%xmm8,,%xmm5 xmm10, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+#DEBUG
+#	movapd	%xmm2, %xmm4
+#	movapd	%xmm1, %xmm5
+#	movapd	%xmm2, %xmm12
+#	movapd	%xmm1, %xmm13
+#	jmp 	.L__vrd4_sin_cleanup
+#DEBUG
+
+	mulpd		.L__real_3fe45f306dc9c883(%rip),%xmm2						# * twobypi
+	addpd		%xmm4,%xmm2					# +0.5, npi2
+	movapd		.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4				# convert packed double to packed integers
+	movapd		.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2				# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm4,region(%rsp)						# Region
+
+#DEBUG
+#	movapd	region(%rsp), %xmm4
+#	movapd	%xmm1, %xmm5
+#	movapd	region(%rsp), %xmm12
+#	movapd	%xmm1, %xmm13
+#	jmp 	.L__vrd4_sin_cleanup
+#DEBUG
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm0,%xmm6						# rhead
+       subpd	%xmm8,%xmm0						# r = rhead - rtail
+       movapd	 %xmm0,r(%rsp)
+
+       subpd	%xmm0,%xmm6						# rr=rhead-r
+       subpd	%xmm8,%xmm6						# rr=(rhead-r) -rtail
+       movapd	 %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+#DEBUG
+#	movapd	region(%rsp), %xmm4
+#	movapd	%xmm1, %xmm5
+#	movapd	region(%rsp), %xmm12
+#	movapd	%xmm1, %xmm13
+#	jmp 	.L__vrd4_sin_cleanup
+#DEBUG
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r9
+	jae	.Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movlpd	 %xmm1,r1(%rsp)				#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm1,%xmm1				#Needed since we want to work on upper arg
+	movhlps	%xmm3,%xmm3
+	movhlps	%xmm7,%xmm7
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4	#0.5 for later use
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r9d				# r9d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%r9d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm0					# xmm0 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r9d,region1+4(%rsp)			# store upper region
+       movsd	 %xmm7,%xmm1
+       subsd	 %xmm0,%xmm1					# xmm1 = r=(rhead-rtail)
+       subsd	 %xmm1,%xmm7					# rr=rhead-r
+       subsd	 %xmm0,%xmm7					# xmm7 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm1,r1+8(%rsp)			# store upper r
+       movlpd	 %xmm7,rr1+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf_higher
+
+	lea	 region1(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 rr1(%rsp),%rsi
+	lea	 r1(%rsp),%rdi
+	movlpd	 r1(%rsp),%xmm0				#Restore lower fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf_higher:
+	mov	p_original1(%rsp),%r8			# upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1(%rsp)				# rr = 0
+	mov	 %r10d,region1(%rsp)			# region =0
+	and 	.L__real_naninf_lower_sign_mask(%rip),%r13	# Sign
+
+.align 16
+0:
+
+
+#DEBUG
+#	movapd	r(%rsp), %xmm4
+#	movapd	r1(%rsp), %xmm5
+#	movapd	r(%rsp), %xmm12
+#	movapd	r1(%rsp), %xmm13
+#	jmp 	.L__vrd4_sin_cleanup
+#DEBUG
+
+
+	jmp 	.L__vrd4_sin_reconstruct
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd 	%xmm1,r1+8(%rsp)		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher
+
+	mov	 %r9,p_temp1(%rsp)			#Save upper arg
+	lea	 region1(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 rr1(%rsp),%rsi
+	lea	 r1(%rsp),%rdi
+	movsd	 %xmm1,%xmm0
+        call    __amd_remainder_piby2@PLT
+	mov	 p_temp1(%rsp),%r9			#Restore upper arg
+
+	jmp 	0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher:				#lower arg is nan/inf
+	mov	p_original1(%rsp),%r8
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1(%rsp)				#rr = 0
+	mov	 %r10d,region1(%rsp)			#region = 0
+	and 	.L__real_naninf_lower_sign_mask(%rip),%r13	# Sign
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher
+
+	lea	 region1+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 rr1+8(%rsp),%rsi
+	lea	 r1+8(%rsp),%rdi
+	movlpd	 r1+8(%rsp),%xmm0			#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher:
+	mov	p_original1+8(%rsp),%r9		#upper arg is nan/inf
+#	movd	%xmm6,%r9					;upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)				#r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	 %r10,rr1+8(%rsp)			#rr = 0
+	mov	 %r10d,region1+4(%rsp)			#region = 0
+	and 	.L__real_naninf_upper_sign_mask(%rip),%r13	# Sign
+
+.align 16
+0:
+
+	jmp 	.L__vrd4_sin_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2						# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm0		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movq	 %xmm4,region(%rsp)						# Region
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm0						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm0,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm0						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm0						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm0,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       movapd	%xmm0,%xmm6						# rhead
+       subpd	%xmm8,%xmm0						# r = rhead - rtail
+       movapd	 %xmm0,r(%rsp)
+
+       subpd	%xmm0,%xmm6						# rr=rhead-r
+       subpd	%xmm8,%xmm6						# rr=(rhead-r) -rtail
+       movapd	 %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd	 %xmm1,r1+8(%rsp)		#Save upper fp arg for remainder_piby2 call
+#	movlhps	%xmm1,%xmm1			#Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+#	movlhps	%xmm3,%xmm3
+#	movlhps	%xmm7,%xmm7
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4	#0.5 for later use
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r8d				# r8d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%r8d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm0					# xmm0 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r8d,region1(%rsp)			# store lower region
+       movsd	%xmm7,%xmm1
+       subsd	%xmm0,%xmm1					# xmm0 = r=(rhead-rtail)
+       subsd	%xmm1,%xmm7					# rr=rhead-r
+       subsd	%xmm0,%xmm7					# xmm6 = rr=((rhead-r) -rtail)
+
+       movlpd	 %xmm1,r1(%rsp)				# store upper r
+       movlpd	 %xmm7,rr1(%rsp)				# store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrd4_sin_upper_naninf_higher
+
+	lea	 region1+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 rr1+8(%rsp),%rsi
+	lea	 r1+8(%rsp),%rdi
+	movlpd	 r1+8(%rsp),%xmm0	#Restore upper fp arg for remainder_piby2 call
+        call    __amd_remainder_piby2@PLT
+	jmp 	0f
+
+.L__vrd4_sin_upper_naninf_higher:
+	mov	p_original1+8(%rsp),%r9		# upper arg is nan/inf
+#	mov	r1+8(%rsp),%r9				; upper arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	%r9,r1+8(%rsp)				# r = x | 0x0008000000000000
+	xor	%r10,%r10
+	mov	%r10,rr1+8(%rsp)			# rr = 0
+	mov	%r10d,region1+4(%rsp)			# region =0
+	and 	.L__real_naninf_upper_sign_mask(%rip),%r13	# Sign
+
+.align 16
+0:
+	jmp	.L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#DEBUG
+#	movapd	region(%rsp), %xmm4
+#	movapd	region1(%rsp), %xmm5
+#	movapd	region(%rsp), %xmm12
+#	movapd	region1(%rsp), %xmm13
+#	jmp 	.L__vrd4_sin_cleanup
+#DEBUG
+
+
+	movapd	r(%rsp),%xmm0
+	movapd	r1(%rsp),%xmm1
+
+	movapd	rr(%rsp),%xmm6
+	movapd	rr1(%rsp),%xmm7
+
+	mov	region(%rsp),%r8
+	mov	region1(%rsp),%r9
+
+	movlpd	region(%rsp),%xmm4
+	movlpd	region1(%rsp),%xmm5
+
+	pand	.L__reald_one_one(%rip),%xmm4			#odd/even region for cos/sin
+	pand	.L__reald_one_one(%rip),%xmm5			#odd/even region for cos/sin
+
+	xorpd	%xmm12,%xmm12
+	pcmpeqd		%xmm12,%xmm4
+	pcmpeqd		%xmm12,%xmm5
+
+	punpckldq	%xmm4,%xmm4
+	punpckldq	%xmm5,%xmm5
+
+	movapd		 %xmm4,p_region(%rsp)
+	movapd		 %xmm5,p_region1(%rsp)
+
+	mov 	.L__reald_one_zero(%rip),%rdx		#compare value for cossin path
+	mov 	%r8,%r10
+	mov 	%r9,%r11
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_signs(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_signs+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_signs1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_signs1+8(%rsp)		#write out upper sign bit
+
+	movapd	%xmm0,%xmm2				# r
+	movapd	%xmm1,%xmm3				# r
+
+	mulpd	%xmm0,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	add	.L__reald_one_one(%rip),%r8
+	add	.L__reald_one_one(%rip),%r9
+
+	and	.L__reald_two_two(%rip),%r8
+	and	.L__reald_two_two(%rip),%r9
+
+	shr	$1,%r8
+	shr	$1,%r9
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	and	.L__reald_one_zero(%rip),%rax	#mask out the lower sign bit leaving the upper sign bit
+	and	.L__reald_one_zero(%rip),%rcx	#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r8				#shift lower sign bit left by 63 bits
+	shl	$63,%r9				#shift lower sign bit left by 63 bits
+
+	shl	$31,%rax				#shift upper sign bit left by 31 bits
+	shl	$31,%rcx				#shift upper sign bit left by 31 bits
+
+	mov 	 %r8,p_signc(%rsp)		#write out lower sign bit
+	mov 	 %rax,p_signc+8(%rsp)		#write out upper sign bit
+	mov 	 %r9,p_signc1(%rsp)		#write out lower sign bit
+	mov 	 %rcx,p_signc1+8(%rsp)		#write out upper sign bit
+
+	jmp 	.Lsinsin_sinsin_piby4
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_cleanup:
+
+	xorpd	 p_signs(%rsp),%xmm4		# (+) Sign
+	xorpd	 p_signs1(%rsp),%xmm5		# (+) Sign
+
+	xorpd	 p_signc(%rsp),%xmm12		# (+) Sign
+	xorpd	 p_signc1(%rsp),%xmm13		# (+) Sign
+
+.L__vrda_bottom1:
+# store the result _m128d
+	mov	save_ysa(%rsp),%rdi	# get ysin_array pointer
+	mov	save_yca(%rsp),%rbx	# get ycos_array pointer
+
+	movlpd	 %xmm4,(%rdi)
+	movhpd	 %xmm4,8(%rdi)
+
+	movlpd	 %xmm12,(%rbx)
+	movhpd	 %xmm12,8(%rbx)
+
+.L__vrda_bottom2:
+
+	prefetch	64(%rdi)
+	prefetch	64(%rbx)
+
+	add		$32,%rdi
+	add		$32,%rbx
+
+	mov		%rdi,save_ysa(%rsp)	# save ysin_array pointer
+	mov		%rbx,save_yca(%rsp)	# save ycos_array pointer
+
+# store the result _m128d
+	movlpd	%xmm5, -16(%rdi)
+	movhpd	%xmm5, -8(%rdi)
+
+	movlpd	%xmm13, -16(%rbx)
+	movhpd	%xmm13, -8(%rbx)
+
+	mov	 p_iter(%rsp),%rax	# get number of iterations
+	sub	 $1,%rax
+	mov	 %rax,p_iter(%rsp)	# save number of iterations
+	jnz	.L__vrda_top
+
+# see if we need to do any extras
+	mov	save_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz	.L__vrda_cleanup
+
+.L__final_check:
+
+	mov	save_r12(%rsp),%r12	# restore r12
+	mov	save_r13(%rsp),%r13	# restore r13
+	mov	save_rbx(%rsp),%rbx	# restore rbx
+
+	add	$0x0308,%rsp
+	ret
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# we assume that rdx is pointing at the next x array element, r8 at the next y array element.
+# The number of values left is in save_nv
+
+.align	16
+.L__vrda_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__final_check         # exit if not
+
+	mov		 save_xa(%rsp),%rsi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorpd		 %xmm0,%xmm0
+	movlpd		 %xmm0,p_temp+8(%rsp)
+	movapd		 %xmm0,p_temp+16(%rsp)
+
+	mov		 (%rsi),%rcx			# we know there's at least one
+	mov	 	 %rcx,p_temp(%rsp)
+	cmp		 $2,%rax
+	jl		.L__vrdacg
+
+	mov		 8(%rsi),%rcx			# do the second value
+	mov	 	 %rcx,p_temp+8(%rsp)
+	cmp		 $3,%rax
+	jl		.L__vrdacg
+
+	mov		 16(%rsi),%rcx			# do the third value
+	mov	 	 %rcx,p_temp+16(%rsp)
+
+.L__vrdacg:
+	mov		$4,%rdi				# parameter for N
+	lea		p_temp(%rsp),%rsi		# &x parameter
+	lea		p_temp2(%rsp),%rdx 		# &ys parameter
+	lea		p_temp4(%rsp),%rcx 		# &yc parameter
+
+        call    vrda_sincos@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		 save_ysa(%rsp),%rdi
+	mov		 save_yca(%rsp),%rbx
+	mov		 save_nv(%rsp),%rax		# get number of values
+
+	mov	 	 p_temp2(%rsp),%rcx
+	mov		 %rcx,(%rdi)			# we know there's at least one
+	mov	 	 p_temp4(%rsp),%rdx
+	mov		 %rdx,(%rbx)			# we know there's at least one
+	cmp		 $2,%rax
+	jl		.L__vrdacgf
+
+	mov	 	 p_temp2+8(%rsp),%rcx
+	mov		 %rcx,8(%rdi)			# do the second value
+	mov	 	 p_temp4+8(%rsp),%rdx
+	mov		 %rdx,8(%rbx)			# do the second value
+	cmp		 $3,%rax
+	jl		.L__vrdacgf
+
+	mov	 	 p_temp2+16(%rsp),%rcx
+	mov		 %rcx,16(%rdi)			# do the third value
+	mov	 	 p_temp4+16(%rsp),%rdx
+	mov		 %rdx,16(%rbx)			# do the third value
+
+.L__vrdacgf:
+	jmp		.L__final_check

diff --git a/src/gas/vrs4cosf.S b/src/gas/vrs4cosf.S
new file mode 100644
index 0000000..ab59058
--- /dev/null
+++ b/src/gas/vrs4cosf.S

@@ -0,0 +1,2122 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4_cosf.s
+#
+# A vector implementation of the cosf libm function.
+#
+# Prototype:
+#
+#    __m128 __vrs4_cosf(__m128 x);
+#
+# Computes Cosine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 single precision Cosine values at a time.
+# The four values are passed as packed single in xmm10.
+# The four results are returned as packed singles in xmm10.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 4 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# Author: Harsha Jagasia
+# Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff	#Sign bit zero
+			.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000	# 1.0
+			.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+			.quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000	# 0.5
+			.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555	# 0.166666666666
+			.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883	# twobypi
+			.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000	# piby2_1
+			.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331	# piby2_1tail
+			.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000	# piby2_2
+			.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073	# piby2_2tail
+			.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000	# mask for stripping head and tail
+			.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+			.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+			.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+			.quad 0
+.L__reald_one_zero:	.quad 0x00000000100000000	# sin_cos_filter
+			.quad 0
+.L__reald_zero_one:	.quad 0x00000000000000001	#
+			.quad 0
+.L__reald_two_zero:	.quad 0x00000000200000000	#
+			.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+			.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+			.quad 0x03ff0000000000000	#
+.L__real_zero:		.quad 0x00000000000000000	#
+			.quad 0x00000000000000000	#
+.L__real_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+
+.Lcosarray:
+	.quad	0x03FA5555555502F31		#  0.0416667			c1
+	.quad	0x03FA5555555502F31
+	.quad	0x0BF56C16BF55699D7		# -0.00138889			c2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x03EFA015C50A93B49		#  2.48016e-005			c3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x0BE92524743CC46B8		# -2.75573e-007			c4
+	.quad	0x0BE92524743CC46B8
+
+.Lsinarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BFC555555545E87D
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x03F811110DF01232D
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x0BE92524743CC46B8
+
+.Lcossinarray:
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BF56C16BF55699D7		#				c2
+	.quad	0x03F811110DF01232D
+	.quad	0x03EFA015C50A93B49		#				c3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x0BE92524743CC46B8		#				c4
+	.quad	0x03EC6DBE4AD1572D5
+
+
+.align 64
+.Levencos_oddsin_tbl:
+
+		.quad	.Lcoscos_coscos_piby4		# 0		*	; Done
+		.quad	.Lcoscos_cossin_piby4		# 1		+	; Done
+		.quad	.Lcoscos_sincos_piby4		# 2			; Done
+		.quad	.Lcoscos_sinsin_piby4		# 3		+	; Done
+
+		.quad	.Lcossin_coscos_piby4		# 4			; Done
+		.quad	.Lcossin_cossin_piby4		# 5		*	; Done
+		.quad	.Lcossin_sincos_piby4		# 6			; Done
+		.quad	.Lcossin_sinsin_piby4		# 7			; Done
+
+		.quad	.Lsincos_coscos_piby4		# 8			; Done
+		.quad	.Lsincos_cossin_piby4		# 9			; TBD
+		.quad	.Lsincos_sincos_piby4		# 10		*	; Done
+		.quad	.Lsincos_sinsin_piby4		# 11			; Done
+
+		.quad	.Lsinsin_coscos_piby4		# 12			; Done
+		.quad	.Lsinsin_cossin_piby4		# 13		+	; Done
+		.quad	.Lsinsin_sincos_piby4		# 14			; Done
+		.quad	.Lsinsin_sinsin_piby4		# 15		*	; Done
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+    .text
+    .align 16
+    .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ	p_temp,0		# temporary for get/put bits operation
+.equ	p_temp1,0x10		# temporary for get/put bits operation
+
+.equ	save_xmm6,0x20		# temporary for get/put bits operation
+.equ	save_xmm7,0x30		# temporary for get/put bits operation
+.equ	save_xmm8,0x40		# temporary for get/put bits operation
+.equ	save_xmm9,0x50		# temporary for get/put bits operation
+.equ	save_xmm0,0x60		# temporary for get/put bits operation
+.equ	save_xmm11,0x70		# temporary for get/put bits operation
+.equ	save_xmm12,0x80		# temporary for get/put bits operation
+.equ	save_xmm13,0x90		# temporary for get/put bits operation
+.equ	save_xmm14,0x0A0	# temporary for get/put bits operation
+.equ	save_xmm15,0x0B0	# temporary for get/put bits operation
+
+.equ	r,0x0C0			# pointer to r for remainder_piby2
+.equ	rr,0x0D0		# pointer to r for remainder_piby2
+.equ	region,0x0E0		# pointer to r for remainder_piby2
+
+.equ	r1,0x0F0		# pointer to r for remainder_piby2
+.equ	rr1,0x0100		# pointer to r for remainder_piby2
+.equ	region1,0x0110		# pointer to r for remainder_piby2
+
+.equ	p_temp2,0x0120		# temporary for get/put bits operation
+.equ	p_temp3,0x0130		# temporary for get/put bits operation
+
+.equ	p_temp4,0x0140		# temporary for get/put bits operation
+.equ	p_temp5,0x0150		# temporary for get/put bits operation
+
+.equ	p_original,0x0160		# original x
+.equ	p_mask,0x0170		# original x
+.equ	p_sign,0x0180		# original x
+
+.equ	p_original1,0x0190		# original x
+.equ	p_mask1,0x01A0		# original x
+.equ	p_sign1,0x01B0		# original x
+
+
+.equ	save_r12,0x01C0		# temporary for get/put bits operation
+.equ	save_r13,0x01D0		# temporary for get/put bits operation
+
+
+.globl __vrs4_cosf
+    .type   __vrs4_cosf,@function
+__vrs4_cosf:
+	sub		$0x01E8,%rsp
+
+#DEBUG
+#	mov	%r12,save_r12(%rsp)	# save r12
+#	mov	%r13,save_r13(%rsp)	# save r13
+
+#	mov	save_r12(%rsp),%r12	# restore r12
+#	mov	save_r13(%rsp),%r13	# restore r13
+
+#	add	$0x01E8,%rsp
+#	ret
+#DEBUG
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	mov	%r12,save_r12(%rsp)	# save r12
+	mov	%r13,save_r13(%rsp)	# save r13
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+	movhlps		%xmm0,%xmm8
+	cvtps2pd	%xmm0,%xmm10			# convert input to double.
+	cvtps2pd	%xmm8,%xmm1			# convert input to double.
+
+movdqa	%xmm10,%xmm6
+movdqa	%xmm1,%xmm7
+movapd	.L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd 	%xmm2,%xmm10	#Unsign
+andpd 	%xmm2,%xmm1	#Unsign
+
+movd	%xmm10,%rax				#rax is lower arg
+movhpd	%xmm10, p_temp+8(%rsp)			#
+mov    	p_temp+8(%rsp),%rcx			#rcx = upper arg
+
+movd	%xmm1,%r8				#r8 is lower arg
+movhpd	%xmm1, p_temp1+8(%rsp)			#
+mov    	p_temp1+8(%rsp),%r9			#r9 = upper arg
+
+movdqa	%xmm10,%xmm12
+movdqa	%xmm1,%xmm13
+
+pcmpgtd		%xmm6,%xmm12
+pcmpgtd		%xmm7,%xmm13
+movdqa		%xmm12,%xmm6
+movdqa		%xmm13,%xmm7
+psrldq		$4,%xmm12
+psrldq		$4,%xmm13
+psrldq		$8,%xmm6
+psrldq		$8,%xmm7
+
+mov 	$0x3FE921FB54442D18,%rdx			#piby4	+
+mov	$0x411E848000000000,%r10			#5e5	+
+movapd	.L__real_3fe0000000000000(%rip),%xmm4		#0.5 for later use +
+
+por	%xmm6,%xmm12
+por	%xmm7,%xmm13
+
+movapd	%xmm10,%xmm2				#x0
+movapd	%xmm1,%xmm3				#x1
+movapd	%xmm10,%xmm6				#x0
+movapd	%xmm1,%xmm7				#x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax
+	jae	.Lfirst_or_next3_arg_gt_5e5
+
+	cmp	%r10,%rcx
+	jae	.Lsecond_or_next2_arg_gt_5e5
+
+	cmp	%r10,%r8
+	jae	.Lthird_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfourth_arg_gt_5e5
+
+
+#      /* Find out what multiple of piby2 */
+#        npi2  = (int)(x * twobypi + 0.5);
+	movapd	.L__real_3fe45f306dc9c883(%rip),%xmm10
+	mulpd	%xmm10,%xmm2						# * twobypi
+	mulpd	%xmm10,%xmm3						# * twobypi
+
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%r8						# Region
+	movd	%xmm5,%r9						# Region
+
+	mov 	.L__reald_one_zero(%rip),%rdx			#compare value for cossin path
+	mov	%r8,%r10
+	mov	%r9,%r11
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm1,%xmm7						# t-rhead
+
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign
+# xmm5  = npi2 (int), xmm1 =rhead, xmm9 =rtail,  r9 = region, r11 = region, r13 = Sign
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	xor	%rax,%r10
+	xor	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+
+# GET_BITS_DP64(rhead-rtail, uy);			   		; originally only rhead
+# xmm4  = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1  =rhead, xmm9 =rtail
+	movapd	%xmm10,%xmm6						# rhead
+	movapd	%xmm1,%xmm7						# rhead
+
+	subpd	%xmm8,%xmm10						# r = rhead - rtail
+	subpd	%xmm9,%xmm1						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2				# move r for r2
+	movapd	%xmm1,%xmm3				# move r for r2
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+	leaq	.Levencos_oddsin_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+#	    %xmm11,,%xmm9 xmm13
+
+
+	movlpd	%xmm10,r(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm10,%xmm10		#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm8 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm12 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	%ecx,region+4(%rsp)			# store upper region
+       movsd	%xmm6,%xmm10
+       subsd	%xmm0,%xmm10					# xmm10 = r=(rhead-rtail)
+       subsd	%xmm10,%xmm6					# rr=rhead-r
+       subsd	%xmm0,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	%xmm10,r+8(%rsp)			# store upper r
+       movlpd	%xmm6,rr+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_lower_naninf
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r(%rsp),%rdi			#Restore lower fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_cosf_lower_naninf:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			# region =0
+
+.align 16
+0:
+	jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+	movhlps %xmm10,%xmm6		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+	movd	%xmm10,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	jmp 	0f
+
+.L__vrs4_cosf_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%rax
+	mov	 %rax,r(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_upper_naninf_of_both_gt_5e5
+
+
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm6,%rdi			#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrs4_cosf_upper_naninf_of_both_gt_5e5:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%rcx
+	mov	 %rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+#   %xmm9,,%xmm5 xmm11, xmm13
+
+	movhpd	 %xmm10,r+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm7 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store upper region
+
+        subsd	%xmm0,%xmm6					# xmm10 = r=(rhead-rtail)
+
+        movlpd	 %xmm6,r(%rsp)				# store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_upper_naninf
+
+	mov	 %r8,p_temp(%rsp)
+	mov	 %r9,p_temp2(%rsp)
+	movapd	 %xmm1,p_temp1(%rsp)
+	movapd	 %xmm3,p_temp3(%rsp)
+	movapd	 %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r+8(%rsp),%rdi				#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	p_temp(%rsp),%r8
+	mov	p_temp2(%rsp),%r9
+	movapd	p_temp1(%rsp),%xmm1
+	movapd	p_temp3(%rsp),%xmm3
+	movapd	p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_cosf_upper_naninf:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%rcx
+	mov	 %rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r8
+	jae	.Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4			#Restore 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm3						# * twobypi
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm5,region1(%rsp)						# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm1,%xmm7						# t-rhead
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       subpd	%xmm9,%xmm1						# r = rhead - rtail
+       movapd	%xmm1,r1(%rsp)
+
+       jmp	.L__vrs4_cosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use 	%xmm11,,%xmm9 xmm13
+# 	%xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	%xmm10,r(%rsp)
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+	cmp	%r10,%r9
+	jae	.Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movlpd	 %xmm1,r1(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm1,%xmm1			#Needed since we want to work on upper arg
+	movhlps	%xmm3,%xmm3
+	movhlps	%xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r9d				# r9d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r9d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r9d,region1+4(%rsp)			# store upper region
+
+       subsd	%xmm10,%xmm7					# xmm1 = r=(rhead-rtail)
+
+       movlpd	 %xmm7,r1+8(%rsp)			# store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_lower_naninf_higher
+
+	lea	 region1(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r1(%rsp),%rdi				#Restore lower fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_cosf_lower_naninf_higher:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%r8
+	mov	 %r8,r1(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.L__vrs4_cosf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+	movhlps %xmm1,%xmm7		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher
+
+	mov	  %r9,p_temp1(%rsp)			#Save upper arg
+	lea	  region1(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	  r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm1,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp1(%rsp),%r9			#Restore upper arg
+
+	jmp 	0f
+
+.L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher:				#lower arg is nan/inf
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%r8
+	mov	 %r8,r1(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher
+
+	lea	 region1+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm7,%rdi			#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%r9
+	mov	 %r9,r1+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			#region = 0
+
+.align 16
+0:
+
+	jmp 	.L__vrs4_cosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	 %xmm10,r(%rsp)
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd	 %xmm1,r1+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r8d				# r8d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r8d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+        mov	 %r8d,region1(%rsp)			# store lower region
+
+        subsd	%xmm10,%xmm7					# xmm10 = r=(rhead-rtail)
+
+        movlpd	 %xmm7,r1(%rsp)				# store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_upper_naninf_higher
+
+	lea	 region1+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r1+8(%rsp),%rdi	#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_cosf_upper_naninf_higher:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%r9
+	mov	 %r9,r1+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp	.L__vrs4_cosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_cosf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	r(%rsp),%xmm10
+	movapd	r1(%rsp),%xmm1
+
+	mov	region(%rsp),%r8
+	mov	region1(%rsp),%r9
+	mov 	.L__reald_one_zero(%rip),%rdx		#compare value for cossin path
+
+	mov 	%r8,%r10
+	mov 	%r9,%r11
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	xor	%rax,%r10
+	xor	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+	leaq	.Levencos_oddsin_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_cosf_cleanup:
+
+	movapd	  p_sign(%rsp),%xmm10
+	movapd	  p_sign1(%rsp),%xmm1
+
+	xorpd	%xmm4,%xmm10			# (+) Sign
+	xorpd	%xmm5,%xmm1			# (+) Sign
+
+	cvtpd2ps %xmm10,%xmm0
+	cvtpd2ps %xmm1,%xmm11
+	movlhps	 %xmm11,%xmm0
+
+	mov	save_r12(%rsp),%r12	# restore r12
+	mov	save_r13(%rsp),%r13	# restore r13
+
+	add	$0x01E8,%rsp
+	ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+	movapd	%xmm2,%xmm0					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0	;trash r
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	addpd	.Lcosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm2,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd   %xmm0,%xmm4					# + t
+	subpd   %xmm11,%xmm5					# + t
+
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4		# s4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5		# s4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8		# s2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4		# s4+x2s3
+	addpd	.Lsincosarray+0x20(%rip),%xmm5		# s4+x2s3
+	addpd	.Lsincosarray(%rip),%xmm8		# s2+x2s1
+	addpd	.Lsincosarray(%rip),%xmm9		# s2+x2s1
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm0,%xmm0				# move high x4 for cos term
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	%xmm10,%xmm6				# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos
+	movhlps	%xmm3,%xmm13				# move high r for cos
+
+	movhlps	%xmm4,%xmm8				# xmm4 = sin , xmm8 = cos
+	movhlps	%xmm5,%xmm9				# xmm4 = sin , xmm8 = cos
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm7,%xmm5				# sin *x3
+
+	mulsd	%xmm0,%xmm8				# cos *x4
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	#-t=r-1.0
+
+	addsd	%xmm10,%xmm4				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+	subsd   %xmm12,%xmm8				# cos+t
+	subsd   %xmm13,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:
+
+	movapd	.Lsincosarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lcossinarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lsincosarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lcossinarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm3,%xmm7				# sincos term upper x2 for x3
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4		# s3+x2s4
+	addpd	.Lcossinarray+0x20(%rip),%xmm5		# s3+x2s4
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2s2
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2s2
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm0,%xmm0				# move high x4 for cos term
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	mulpd	%xmm1,%xmm7
+
+	mulsd	%xmm10,%xmm6				# get low x3 for sin term (cossin)
+	movhlps	%xmm7,%xmm7				# get high x3 for sin term (sincos)
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos term
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+
+	movhlps	%xmm2,%xmm12				# move high r for cos (cossin)
+
+
+	movhlps	%xmm4,%xmm8				# xmm8 = cos , xmm4 = sin	(cossin)
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos	(sincos)
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm11,%xmm5				# cos *x4
+	mulsd	%xmm0,%xmm8				# cos *x4
+	mulsd	%xmm7,%xmm9				# sin *x3
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	# -t=r-1.0
+
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term    (sincos)
+
+	addsd	%xmm10,%xmm4				# sin + x	+
+	addsd	%xmm11,%xmm9				# sin + x	+
+
+	subsd   %xmm12,%xmm8				# cos+t
+	subsd   %xmm3,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	.Lcossinarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lcossinarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lcossinarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lcossinarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm2,%xmm6				# move x2 for x4
+	movapd	%xmm3,%xmm7				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4		# s4+x2s3
+	addpd	.Lcossinarray+0x20(%rip),%xmm5		# s4+x2s3
+	addpd	.Lcossinarray(%rip),%xmm8		# s2+x2s1
+	addpd	.Lcossinarray(%rip),%xmm9		# s2+x2s1
+
+	mulpd	%xmm0,%xmm4				# x4(s4+x2s3)
+	mulpd	%xmm11,%xmm5				# x4(s4+x2s3)
+
+	mulpd	%xmm10,%xmm6				# get low x3 for sin term
+	mulpd	%xmm1,%xmm7				# get low x3 for sin term
+	movhlps	%xmm6,%xmm6				# move low x2 for x3 for sin term
+	movhlps	%xmm7,%xmm7				# move low x2 for x3 for sin term
+
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos terms
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm4,%xmm12				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm13				# xmm9 = sin , xmm5 = cos
+
+	mulsd	%xmm6,%xmm12				# sin *x3
+	mulsd	%xmm7,%xmm13				# sin *x3
+	mulsd	%xmm0,%xmm4				# cos *x4
+	mulsd	%xmm11,%xmm5				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	#-t=r-1.0
+
+	movhlps	%xmm10,%xmm0				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term
+							# Reverse 10 and 0
+
+	addsd	%xmm0,%xmm12				# sin + x
+	addsd	%xmm11,%xmm13				# sin + x
+
+	subsd   %xmm2,%xmm4				# cos+t
+	subsd   %xmm3,%xmm5				# cos+t
+
+	movlhps	%xmm12,%xmm4
+	movlhps	%xmm13,%xmm5
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+
+	movapd	.Lcossinarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lsincosarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lcossinarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lsincosarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm2,%xmm7				# upper x2 for x3 for sin term (sincos)
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4		# s3+x2s4
+	addpd	.Lsincosarray+0x20(%rip),%xmm5		# s3+x2s4
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2s2
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2s2
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+
+	movsd	%xmm3,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	mulpd	%xmm10,%xmm7
+
+	mulsd	%xmm1,%xmm6				# get low x3 for sin term (cossin)
+	movhlps	%xmm7,%xmm7				# get high x3 for sin term (sincos)
+
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos term
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm3,%xmm12				# move high r for cos (cossin)
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos	(sincos)
+	movhlps	%xmm5,%xmm9				# xmm9 = cos , xmm5 = sin	(cossin)
+
+	mulsd	%xmm0,%xmm4				# cos *x4
+	mulsd	%xmm6,%xmm5				# sin *x3
+	mulsd	%xmm7,%xmm8				# sin *x3
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	# -t=r-1.0
+
+	movhlps	%xmm10,%xmm11				# move high x for x for sin term    (sincos)
+
+	subsd	%xmm2,%xmm4				# cos-(-t)
+	subsd	%xmm12,%xmm9				# cos-(-t)
+
+	addsd   %xmm11,%xmm8				# sin + x
+	addsd   %xmm1,%xmm5				# sin + x
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrs4_cosf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: 	SIN
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr:	COS
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm0					# x2	; SIN
+	movapd	%xmm3,%xmm11					# x2	; COS
+	movapd	%xmm3,%xmm1					# copy of x2 for x4
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm0					# x4
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm3,%xmm1					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4				# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5				# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8					# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9				# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+
+	mulpd	%xmm0,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm1,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm1,%xmm5					# x4 * zc
+
+	addpd	%xmm10,%xmm4					# +x
+	subpd   %xmm11,%xmm5					# +t
+
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: 	COS
+# p_sign1  = Sign, xmm1  = r, xmm3 = %xmm7,%r2 =rr:	SIN
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm0					# x2	; COS
+	movapd	%xmm3,%xmm11					# x2	; SIN
+	movapd	%xmm2,%xmm10					# copy of x2 for x4
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# s4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# s2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# s4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# s2*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# s3+x2c4
+	addpd	.Lcosarray(%rip),%xmm8				# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9				# s1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm10,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm11,%xmm5					# x4(s3+x2s4)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm10,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x3 * zc
+
+	subpd	%xmm0,%xmm4					# +t
+	addpd   %xmm1,%xmm5					# +x
+
+	jmp 	.L__vrs4_cosf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4:						#Derive from cossin_coscos
+	movhlps	%xmm2,%xmm0					# x2 for 0.5x2 for upper cos
+	movsd	%xmm2,%xmm6					# lower x2 for x3 for lower sin
+	movapd	%xmm3,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					#  t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lsincosarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcosarray(%rip),%xmm9				# c2+x2c1
+
+	movapd	%xmm12,%xmm2					# upper=x4
+	movsd	%xmm6,%xmm2					# lower=x2
+	mulsd	%xmm10,%xmm2					# lower=x2*x
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# upper= x4 * zc
+								# lower=x3 * zs
+	mulpd	%xmm13,%xmm5					# x4 * zc
+
+	movlhps	%xmm7,%xmm10					#
+	addpd	%xmm10,%xmm4					# +x for lower sin, +t for upper cos
+	subpd   %xmm11,%xmm5					# -(-t)
+
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4:					#Derive from cossin_coscos
+	movsd	%xmm2,%xmm0					# x2 for 0.5x2 for lower cos
+	movapd	%xmm3,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcossinarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcossinarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcossinarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcosarray(%rip),%xmm9				# c2+x2c1
+
+	mulpd	%xmm10,%xmm2					# upper=x3 for sin
+	mulsd	%xmm10,%xmm2					# lower=x4 for cos
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# lower= x4 * zc
+								# upper= x3 * zs
+	mulpd	%xmm13,%xmm5					# x4 * zc
+
+	movsd	%xmm7,%xmm10
+	addpd	%xmm10,%xmm4					# +x for upper sin, +t for lower cos
+	subpd   %xmm11,%xmm5					# -(-t)
+
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+	movhlps	%xmm3,%xmm0					# x2 for 0.5x2 for upper cos
+	movapd	%xmm2,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	%xmm3,%xmm6					# lower x2 for x3 for sin
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lsincosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcosarray(%rip),%xmm8				# c2+x2c1
+	addpd	.Lsincosarray(%rip),%xmm9			# c2+x2c1
+
+	movapd	%xmm13,%xmm3					# upper=x4
+	movsd	%xmm6,%xmm3					# lower x2
+	mulsd	%xmm1,%xmm3					# lower x2*x
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm12,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# upper= x4 * zc
+								# lower=x3 * zs
+
+	movlhps	%xmm7,%xmm1
+	addpd	%xmm1,%xmm5					# +x for lower sin, +t for upper cos
+	subpd   %xmm11,%xmm4					# -(-t)
+
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:						# Derived from sincos_coscos
+
+	movhlps	%xmm3,%xmm0					# x2
+	movapd	%xmm3,%xmm7
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsincosarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8				# c1+x2c2
+	addpd	.Lsincosarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+	movapd	%xmm13,%xmm3					# upper x4 for cos
+	movsd	%xmm7,%xmm3					# lower x2 for sin
+	mulsd	%xmm1,%xmm3					# lower x3=x2*x for sin
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm11,%xmm1					# t for upper cos and x for lower sin
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# upper=x4 * zc
+								# lower=x3 * zs
+
+	addpd   %xmm10,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +t upper, +x lower
+
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+	movsd	%xmm3,%xmm0					# x2 for 0.5x2 for lower cos
+	movapd	%xmm2,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lcossinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lcossinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0	# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcossinarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcosarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcossinarray(%rip),%xmm9			# c2+x2c1
+
+	mulpd	%xmm1,%xmm3					# upper=x3 for sin
+	mulsd	%xmm1,%xmm3					# lower=x4 for cos
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm12,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# lower= x4 * zc
+								# upper= x3 * zs
+
+	movsd	%xmm7,%xmm1
+	subpd   %xmm11,%xmm4					# -(-t)
+	addpd	%xmm1,%xmm5					# +x for upper sin, +t for lower cos
+
+
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4:						# Derived from sincos_coscos
+
+	movsd	%xmm3,%xmm0					# x2
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcossinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcossinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcossinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcossinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm1,%xmm6
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm6,%xmm11					# upper =t ; lower =x
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	addpd   %xmm10,%xmm4					# +x
+	addpd	%xmm11,%xmm5					# +t lower, +x upper
+
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4:						# Derived from sincos_coscos
+
+	movhlps	%xmm2,%xmm0					# x2
+	movapd	%xmm2,%xmm7
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsincosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9				# c1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	movapd	%xmm12,%xmm2					# upper x4 for cos
+	movsd	%xmm7,%xmm2					# lower x2 for sin
+	mulsd	%xmm10,%xmm2					# lower x3=x2*x for sin
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm11,%xmm10					# t for upper cos and x for lower sin
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# upper=x4 * zc
+								# lower=x3 * zs
+
+	addpd	%xmm1,%xmm5					# +x
+	addpd   %xmm10,%xmm4					# +t upper, +x lower
+
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:						# Derived from sincos_coscos
+
+	movsd	%xmm2,%xmm0					# x2
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lcossinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcossinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lcossinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm10,%xmm2					# upper x3 for sin
+	mulsd	%xmm10,%xmm2					# lower x4 for cos
+
+	movhlps	%xmm10,%xmm6
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm6,%xmm11
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	addpd	%xmm1,%xmm5					# +x
+	addpd   %xmm11,%xmm4					# +t lower, +x upper
+
+	jmp 	.L__vrs4_cosf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+  #x2 = x * x;
+  #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))));
+
+  #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4));
+
+
+	movapd	%xmm2,%xmm0					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+
+	mulpd	%xmm2,%xmm0					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	mulpd	%xmm10,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# x3
+
+	addpd	.Lsinarray(%rip),%xmm8				# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9				# c1+x2c2
+
+	mulpd	%xmm0,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm11,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# x3 * zs
+
+	addpd	%xmm10,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+
+	jmp 	.L__vrs4_cosf_cleanup

diff --git a/src/gas/vrs4expf.S b/src/gas/vrs4expf.S
new file mode 100644
index 0000000..b0e23aa
--- /dev/null
+++ b/src/gas/vrs4expf.S

@@ -0,0 +1,410 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# __vrs4_expf.s
+#
+# A vector implementation of the expf libm function.
+#  This routine implemented in single precision.  It is slightly
+#  less accurate than the double precision version, but it will
+#  be better for vectorizing.
+#
+# Prototype:
+#
+#     __m128 __vrs4_expf(__m128 x);
+#
+#   Computes e raised to the x power for 4 floats at a time.
+# Does not perform error handling, but does return C99 values for error
+# inputs.   Denormal results are truncated to 0.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+        .text
+        .align 16
+        .p2align 4,,15
+
+# define local variable storage offsets
+.equ	p_temp,0	# temporary for get/put bits operation
+.equ	p_ux,0x10		# local storage for ux array
+.equ	p_m,0x20		# local storage for m array
+.equ	p_j,0x30		# local storage for m array
+.equ    save_rbx,0x040          #qword
+.equ	stack_size,0x48
+
+
+
+.globl __vrs4_expf
+       .type   __vrs4_expf,@function
+__vrs4_expf:
+	sub	$stack_size,%rsp
+        mov             %rbx,save_rbx(%rsp)
+
+
+	movaps	 %xmm0,p_ux(%rsp)
+        maxps   .L__real_m8192(%rip),%xmm0   # protect against small input values
+
+
+#        /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+#      Step 1. Reduce the argument.
+	#    r = x * thirtytwo_by_logbaseof2;
+	movaps	.L__real_thirtytwo_by_log2(%rip),%xmm2	#
+	mulps	%xmm0,%xmm2
+	xor		%rax,%rax
+        minps   .L__real_8192(%rip),%xmm2   # protect against large input values
+
+#    /* Set n = nearest integer to r */
+	cvtps2dq	%xmm2,%xmm3
+	lea		.L__two_to_jby32_table(%rip),%rdi
+	cvtdq2ps	%xmm3,%xmm1
+
+
+#    r1 = x - n * logbaseof2_by_32_lead;
+	movaps	.L__real_log2_by_32_head(%rip),%xmm2
+	mulps	%xmm1,%xmm2
+	subps	%xmm2,%xmm0				# r1 in xmm0,
+
+#    r2 = - n * logbaseof2_by_32_lead;
+	mulps	.L__real_log2_by_32_tail(%rip),%xmm1
+
+#    j = n & 0x0000001f;
+	movdqa	%xmm3,%xmm4
+	movdqa	.L__int_mask_1f(%rip),%xmm2
+	pand	%xmm4,%xmm2
+	movdqa	%xmm2,p_j(%rsp)
+#    f1 = two_to_jby32_lead_table[j];
+
+#    *m = (n - j) / 32;
+	psubd	%xmm2,%xmm4
+	psrad	$5,%xmm4
+	movdqa	%xmm4,p_m(%rsp)
+
+	movaps	%xmm0,%xmm3
+	addps	%xmm1,%xmm3
+
+	mov		p_j(%rsp),%eax 			# get an individual index
+	mov		(%rdi,%rax,4),%edx		# get the f1 value
+	mov		%edx,p_j(%rsp) 			# save the f1 value
+
+#      Step 2. Compute the polynomial.
+#     q = r1 +
+#              r*r*( 5.00000000000000008883e-01 +
+#                      r*( 1.66666666665260878863e-01 +
+#                      r*( 4.16666666662260795726e-02 +
+#                      r*( 8.33336798434219616221e-03 +
+#                      r*( 1.38889490863777199667e-03 )))));
+#    q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+#    q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision
+	movaps	%xmm3,%xmm4
+	movaps	%xmm3,%xmm2			# x*x
+	mulps	%xmm2,%xmm2
+	mulps	.L__real_1_24(%rip),%xmm4	# /24
+
+	mov		p_j+4(%rsp),%eax 			# get an individual index
+	mov		(%rdi,%rax,4),%edx		# get the f1 value
+	mov		%edx,p_j+4(%rsp) 			# save the f1 value
+
+	addps 	.L__real_1_6(%rip),%xmm4	# +1/6
+
+	mulps	%xmm2,%xmm3			# x^3
+	mov		p_j+8(%rsp),%eax 			# get an individual index
+	mov		(%rdi,%rax,4),%edx		# get the f1 value
+	mov		%edx,p_j+8(%rsp) 			# save the f1 value
+	mulps	.L__real_half(%rip),%xmm2	# x^2/2
+	mulps	%xmm3,%xmm4			# *x^3
+
+	mov		p_j+12(%rsp),%eax 			# get an individual index
+	mov		(%rdi,%rax,4),%edx		# get the f1 value
+	mov		%edx,p_j+12(%rsp) 			# save the f1 value
+	addps	%xmm4,%xmm1			# +r2
+
+	addps	%xmm2,%xmm1			# + x^2/2
+	addps	%xmm1,%xmm0			# +r1
+
+# deal with infinite or denormal results
+        movdqa  p_m(%rsp),%xmm1
+        movdqa  p_m(%rsp),%xmm2
+        pcmpgtd .L__int_127(%rip),%xmm2
+        pminsw  .L__int_128(%rip),%xmm1 # ceil at 128
+        movmskps        %xmm2,%eax
+        test            $0x0f,%eax
+
+        paddd   .L__int_127(%rip),%xmm1 # add bias
+
+#    *z2 = f2 + ((f1 + f2) * q);
+        mulps   p_j(%rsp),%xmm0         # * f1
+        addps   p_j(%rsp),%xmm0         # + f1
+        jnz             .L__exp_largef
+.L__check1:
+
+	pxor	%xmm2,%xmm2			# floor at 0
+	pmaxsw	%xmm2,%xmm1
+
+	pslld	$23,%xmm1			# build 2^n
+
+	movaps	%xmm1,%xmm2
+
+
+# check for infinity or nan
+	movaps	p_ux(%rsp),%xmm1
+	andps	.L__real_infinity(%rip),%xmm1
+	cmpps	$0,.L__real_infinity(%rip),%xmm1
+	movmskps	%xmm1,%eax
+	test		$0xf,%eax
+
+
+# end of splitexp
+#        /* Scale (z1 + z2) by 2.0**m */
+#      Step 3. Reconstitute.
+
+	mulps	%xmm2,%xmm0		# result *= 2^n
+
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them.  But it adds cycles for normal cases
+# to handle events that are supposed to be exceptions.
+#  Using this branch with the
+# check above results in faster code for the normal cases.
+# And branch mispredict penalties should only come into
+# play for nans and infinities.
+	jnz		.L__exp_naninf
+
+#
+.L__final_check:
+        mov             save_rbx(%rsp),%rbx             # restore rbx
+	add		$stack_size,%rsp
+	ret
+
+#  deal with nans and infinities
+
+.L__exp_naninf:
+	movaps	%xmm0,p_temp(%rsp)		# save the computed values
+	mov		%eax,%ecx				# save the error mask
+	test	$1,%ecx					# first value?
+	jz		.__Lni2
+	mov		p_ux(%rsp),%edx			# get the input
+	call	.L__naninf
+	mov		%edx,p_temp(%rsp)		# save the new result
+.__Lni2:
+	test	$2,%ecx					# first value?
+	jz		.__Lni3
+	mov		p_ux+4(%rsp),%edx		# get the input
+	call	.L__naninf
+	mov		%edx,p_temp+4(%rsp)		# save the new result
+.__Lni3:
+	test	$4,%ecx					# first value?
+	jz		.__Lni4
+	mov		p_ux+8(%rsp),%edx		# get the input
+	call	.L__naninf
+	mov		%edx,p_temp+8(%rsp)		# save the new result
+.__Lni4:
+	test	$8,%ecx					# first value?
+	jz		.__Lnie
+	mov		p_ux+12(%rsp),%edx		# get the input
+	call	.L__naninf
+	mov		%edx,p_temp+12(%rsp)		# save the new result
+.__Lnie:
+	movaps	p_temp(%rsp),%xmm0		# get the answers
+	jmp		.L__final_check
+
+
+# a simple subroutine to check a scalar input value for infinity
+# or NaN and return the correct result
+# expects input in edx, and returns value in edx.  Destroys eax.
+.L__naninf:
+	mov	$0x0007FFFFF,%eax
+	test	%eax,%edx
+	jnz		.L__enan		# jump if mantissa not zero, so it's a NaN
+# inf
+	mov		%edx,%eax
+	rcl		$1,%eax
+	jnc		.L__r				# exp(+inf) = inf
+	xor		%edx,%edx				# exp(-inf) = 0
+	jmp		.L__r
+
+#NaN
+.L__enan:
+	mov		$0x000400000,%eax	# convert to quiet
+	or		%eax,%edx
+.L__r:
+	ret
+
+        .align 16
+#  deal with m > 127.  In some instances, rounding during calculations
+#  can result in infinity when it shouldn't.  For these cases, we scale
+#  m down, and scale the mantissa up.
+
+.L__exp_largef:
+        movdqa    %xmm0,p_temp(%rsp)    # save the mantissa portion
+        movdqa    %xmm1,p_m(%rsp)       # save the exponent portion
+        mov       %eax,%ecx              # save the error mask
+        test    $1,%ecx                  # first value?
+        jz       .L__Lf2
+        mov      p_m(%rsp),%edx # get the exponent
+        sub      $1,%edx                # scale it down
+        mov      %edx,p_m(%rsp)       # save the exponent
+        movss   p_temp(%rsp),%xmm3     # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss    %xmm3,p_temp(%rsp)   # save the mantissa
+.L__Lf2:
+        test    $2,%ecx                 # second value?
+        jz       .L__Lf3
+        mov      p_m+4(%rsp),%edx # get the exponent
+        sub      $1,%edx                # scale it down
+        mov      %edx,p_m+4(%rsp)       # save the exponent
+        movss   p_temp+4(%rsp),%xmm3     # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_temp+4(%rsp)   # save the mantissa
+.L__Lf3:
+        test    $4,%ecx                 # third value?
+        jz       .L__Lf4
+        mov      p_m+8(%rsp),%edx # get the exponent
+        sub      $1,%edx                # scale it down
+        mov      %edx,p_m+8(%rsp)       # save the exponent
+        movss   p_temp+8(%rsp),%xmm3     # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss    %xmm3,p_temp+8(%rsp)   # save the mantissa
+.L__Lf4:
+        test    $8,%ecx                                 # fourth value?
+        jz       .L__Lfe
+        mov      p_m+12(%rsp),%edx        # get the exponent
+        sub      $1,%edx                 # scale it down
+        mov      %edx,p_m+12(%rsp)      # save the exponent
+        movss   p_temp+12(%rsp),%xmm3    # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_temp+12(%rsp)  # save the mantissa
+.L__Lfe:
+        movaps  p_temp(%rsp),%xmm0      # restore the mantissa portion back
+        movdqa  p_m(%rsp),%xmm1         # restore the exponent portion
+        jmp             .L__check1
+
+	.data
+        .align 64
+
+.L__real_half:			.long 0x3f000000	# 1/2
+				.long 0x3f000000
+				.long 0x3f000000
+				.long 0x3f000000
+
+.L__real_two:			.long 0x40000000	# 2
+				.long 0x40000000
+				.long 0x40000000
+				.long 0x40000000
+
+.L__real_8192:			.long 0x46000000	# 8192, to protect against really large numbers
+				.long 0x46000000
+				.long 0x46000000
+				.long 0x46000000
+.L__real_m8192:			.long 0xC6000000	# -8192, to protect against really small numbers
+				.long 0xC6000000
+				.long 0xC6000000
+				.long 0xC6000000
+
+.L__real_thirtytwo_by_log2: 	.long 0x4238AA3B	# thirtytwo_by_log2
+				.long 0x4238AA3B
+				.long 0x4238AA3B
+				.long 0x4238AA3B
+
+.L__real_log2_by_32:		.long 0x3CB17218	# log2_by_32
+				.long 0x3CB17218
+				.long 0x3CB17218
+				.long 0x3CB17218
+
+.L__real_log2_by_32_head:	.long 0x3CB17000	# log2_by_32
+				.long 0x3CB17000
+				.long 0x3CB17000
+				.long 0x3CB17000
+
+.L__real_log2_by_32_tail:	.long 0xB585FDF4	# log2_by_32
+				.long 0xB585FDF4
+				.long 0xB585FDF4
+				.long 0xB585FDF4
+
+.L__real_1_6:			.long 0x3E2AAAAB	# 0.16666666666 used in polynomial
+				.long 0x3E2AAAAB
+				.long 0x3E2AAAAB
+				.long 0x3E2AAAAB
+
+.L__real_1_24:			.long 0x3D2AAAAB	# 0.041666668 used in polynomial
+				.long 0x3D2AAAAB
+				.long 0x3D2AAAAB
+				.long 0x3D2AAAAB
+
+.L__real_infinity:		.long 0x7f800000	# infinity
+				.long 0x7f800000
+				.long 0x7f800000
+				.long 0x7f800000
+.L__int_mask_1f:		.long 0x00000001f
+				.long 0x00000001f
+				.long 0x00000001f
+				.long 0x00000001f
+.L__int_128:			.long 0x000000080
+				.long 0x000000080
+				.long 0x000000080
+				.long 0x000000080
+.L__int_127:			.long 0x00000007f
+				.long 0x00000007f
+				.long 0x00000007f
+				.long 0x00000007f
+
+
+.L__two_to_jby32_table:
+   .long  0x3F800000  #          1
+   .long  0x3F82CD87  #  1.0218972
+   .long  0x3F85AAC3  #  1.0442737
+   .long  0x3F88980F  #  1.0671405
+   .long  0x3F8B95C2  #  1.0905077
+   .long  0x3F8EA43A  #  1.1143868
+   .long  0x3F91C3D3  #  1.1387886
+   .long  0x3F94F4F0  #  1.1637249
+   .long  0x3F9837F0  #  1.1892071
+   .long  0x3F9B8D3A  #  1.2152474
+   .long  0x3F9EF532  #  1.2418578
+   .long  0x3FA27043  #   1.269051
+   .long  0x3FA5FED7  #  1.2968396
+   .long  0x3FA9A15B  #  1.3252367
+   .long  0x3FAD583F  #  1.3542556
+   .long  0x3FB123F6  #  1.3839099
+   .long  0x3FB504F3  #  1.4142135
+   .long  0x3FB8FBAF  #  1.4451808
+   .long  0x3FBD08A4  #  1.4768262
+   .long  0x3FC12C4D  #  1.5091645
+   .long  0x3FC5672A  #  1.5422108
+   .long  0x3FC9B9BE  #  1.5759809
+   .long  0x3FCE248C  #  1.6104903
+   .long  0x3FD2A81E  #  1.6457555
+   .long  0x3FD744FD  #  1.6817929
+   .long  0x3FDBFBB8  #  1.7186193
+   .long  0x3FE0CCDF  #  1.7562522
+   .long  0x3FE5B907  #  1.7947091
+   .long  0x3FEAC0C7  #  1.8340081
+   .long  0x3FEFE4BA  #  1.8741677
+   .long  0x3FF5257D  #  1.9152066
+   .long  0x3FFA83B3  #  1.9571441
+   .long 0					# for alignment
+

diff --git a/src/gas/vrs4log10f.S b/src/gas/vrs4log10f.S
new file mode 100644
index 0000000..d6d9ac8
--- /dev/null
+++ b/src/gas/vrs4log10f.S

@@ -0,0 +1,646 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4logf.s
+#
+# A vector implementation of the logf libm function.
+#  This routine implemented in single precision.  It is slightly
+#  less accurate than the double precision version, but it will
+#  be better for vectorizing.
+#
+# Prototype:
+#
+#     __m128 __vrs4_log10f(__m128 x);
+#
+#   Computes the natural log of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+        .text
+        .align 16
+        .p2align 4,,15
+
+# define local variable storage offsets
+.equ	p_x,0			# save x
+.equ	p_idx,0x010		# xmmword index
+.equ	p_z1,0x020		# xmmword index
+.equ	p_q,0x030		# xmmword index
+.equ	p_corr,0x040		# xmmword index
+.equ	p_omask,0x050		# xmmword index
+.equ	save_xmm6,0x060		#
+.equ	save_rbx,0x070		#
+
+.equ	stack_size,0x088
+
+
+
+.globl __vrs4_log10f
+    .type   __vrs4_log10f,@function
+__vrs4_log10f:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+
+# check e as a special case
+	movdqa	%xmm0,p_x(%rsp)	# save x
+#	movdqa	%xmm0,%xmm2
+#	cmpps	$0,.L__real_ef(%rip),%xmm2
+#	movmskps	%xmm2,%r9d
+
+#
+# compute the index into the log tables
+#
+	movdqa	%xmm0,%xmm3
+	movaps	%xmm0,%xmm1
+	psrld	$23,%xmm3
+	subps	.L__real_one(%rip),%xmm1
+	psubd	.L__mask_127(%rip),%xmm3
+	cvtdq2ps	%xmm3,%xmm6			# xexp
+
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+	xor		%r8,%r8
+	movdqa	%xmm3,%xmm2
+	movaps	.L__real_half(%rip),%xmm5							# .5
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrld	$16,%xmm3
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	movdqa	%xmm3,%xmm4
+	psrld	$1,%xmm3
+	paddd	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm4
+	paddd	%xmm4,%xmm3
+	cvtdq2ps	%xmm3,%xmm1
+	packssdw	%xmm3,%xmm3
+	movq	%xmm3,p_idx(%rsp)
+
+
+# reduce and get u
+	movdqa	%xmm0,%xmm3
+	orps		.L__real_half(%rip),%xmm2
+
+
+	mulps	.L__real_3c000000(%rip),%xmm1				# f1 = index/128
+
+
+	subps	%xmm1,%xmm2											# f2 = f - f1
+	mulps	%xmm2,%xmm5
+	addps	%xmm5,%xmm1
+
+	divps	%xmm1,%xmm2				# u
+
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1(%rsp) 			# save the f1 values
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1+8(%rsp) 			# save the f1 value
+
+# solve for ln(1+u)
+	movaps	%xmm2,%xmm1				# u
+	mulps	%xmm2,%xmm2				# u^2
+	movaps	%xmm2,%xmm5
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	movaps	.L__real_cb3(%rip),%xmm3
+	mulps	%xmm2,%xmm3				#Cu2
+	mulps	%xmm1,%xmm5				# u^3
+	addps	.L__real_cb2(%rip),%xmm3 #B+Cu2
+	movaps	%xmm2,%xmm4
+	mulps	%xmm5,%xmm4				# u^5
+	movaps	.L__real_log2_lead(%rip),%xmm2
+
+	mulps	.L__real_cb1(%rip),%xmm5 #Au3
+	addps	%xmm5,%xmm1				# u+Au3
+	mulps	%xmm3,%xmm4				# u5(B+Cu2)
+
+	addps	%xmm4,%xmm1				# poly
+
+# recombine
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q(%rsp) 			# save the f1 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q+8(%rsp) 			# save the f1 value
+
+	addps	p_q(%rsp),%xmm1 #z2	+=q
+
+	movaps	p_z1(%rsp),%xmm0			# z1  values
+
+	mulps	%xmm6,%xmm2
+	addps	%xmm2,%xmm0				#r1
+	movaps	%xmm0,%xmm2
+	mulps	.L__real_log2_tail(%rip),%xmm6
+	addps	%xmm6,%xmm1				#r2
+	movaps	%xmm1,%xmm3
+
+#	logef to log10f
+	mulps 	.L__real_log10e_tail(%rip),%xmm1
+	mulps 	.L__real_log10e_tail(%rip),%xmm0
+	mulps 	.L__real_log10e_lead(%rip),%xmm3
+	mulps 	.L__real_log10e_lead(%rip),%xmm2
+	addps 	%xmm1,%xmm0
+	addps 	%xmm3,%xmm0
+	addps	%xmm2,%xmm0
+#	addps	%xmm1,%xmm0
+
+# check for e
+#	test		$0x0f,%r9d
+#	jnz			.L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+	xorps	%xmm1,%xmm1
+	cmpps	$1,p_x(%rsp),%xmm1	# 0 greater than =?. catches NaNs also.
+	movmskps	%xmm1,%r9d
+	cmp		$0x0f,%r9d
+	jnz		.L__z_or_neg
+
+.L__f2:
+##  if +inf
+	movaps	p_x(%rsp),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test		$0x0f,%r9d
+	jnz		.L__log_inf
+.L__f3:
+
+	movaps	p_x(%rsp),%xmm3
+	subps	.L__real_one(%rip),%xmm3
+	andps	.L__real_notsign(%rip),%xmm3
+	cmpps	$2,.L__real_threshold(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test	$0x0f,%r9d
+	jnz		.L__near_one
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+.L__vlogf_e:
+	movdqa	p_x(%rsp),%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movdqa	%xmm2,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-e values
+	andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+	orps	%xmm3,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	jmp		.L__f1
+
+	.align	16
+.L__near_one:
+# saves 10 cycles
+#      r = x - 1.0;
+	movdqa	%xmm3,p_omask(%rsp)	# save ones mask
+	movaps	p_x(%rsp),%xmm3
+	movaps	.L__real_two(%rip),%xmm2
+	subps	.L__real_one(%rip),%xmm3	   # r
+#      u          = r / (2.0 + r);
+	addps	%xmm3,%xmm2
+	movaps	%xmm3,%xmm1
+	divps	%xmm2,%xmm1		# u
+	movaps	.L__real_ca4(%rip),%xmm4	  #D
+	movaps	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movaps	%xmm3,%xmm6
+	mulps	%xmm1,%xmm6		# correction
+	movdqa	%xmm6,p_corr(%rsp)	# save correction
+#      u          = u + u;
+	addps	%xmm1,%xmm1		#u
+	movaps	%xmm1,%xmm2
+	mulps	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulps	%xmm1,%xmm5		# Cu
+	movaps	%xmm1,%xmm6
+	mulps	%xmm2,%xmm6		# u^3
+	mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulps	%xmm6,%xmm4		#Du^3
+
+	addps	.L__real_ca1(%rip),%xmm2	# +A
+	movaps	%xmm6,%xmm1
+	mulps	%xmm1,%xmm1		# u^6
+	addps	%xmm4,%xmm5		#Cu+Du3
+
+	mulps	%xmm6,%xmm2		#u3(A+Bu2)
+	mulps	%xmm5,%xmm1		#u6(Cu+Du3)
+	addps	%xmm1,%xmm2
+	subps	p_corr(%rsp),%xmm2		# -correction
+
+	movaps  %xmm3,%xmm5 	#r1=r
+	pand 	.L__mask_lower(%rip),%xmm5
+	subps	%xmm5,%xmm3
+	addps	%xmm3,%xmm2	#r2 = r2 + (r-r1)
+
+	movaps	%xmm5,%xmm3
+	movaps	%xmm2,%xmm1
+
+	mulps 	.L__real_log10e_tail(%rip),%xmm2
+	mulps 	.L__real_log10e_tail(%rip),%xmm3
+	mulps 	.L__real_log10e_lead(%rip),%xmm1
+	mulps 	.L__real_log10e_lead(%rip),%xmm5
+	addps 	%xmm2,%xmm3
+	addps 	%xmm1,%xmm3
+	addps	%xmm5,%xmm3
+#      return r + r2;
+#	addps	%xmm2,%xmm3
+
+	movdqa	p_omask(%rsp),%xmm6
+	movdqa	%xmm6,%xmm2
+	andnps	%xmm0,%xmm6					# keep the non-nearone values
+	andps	%xmm3,%xmm2					# setup the nearone values
+	orps	%xmm6,%xmm2					# merge
+	movdqa	%xmm2,%xmm0					# and replace
+
+	jmp		.L__finish
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+	movdqa	%xmm1,%xmm3
+	andps	%xmm0,%xmm3							# keep the non-error values
+	andnps	.L__real_nan(%rip),%xmm1			# setup the nan values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+# check for +/- 0
+	xorps	%xmm1,%xmm1
+	cmpps	$0,p_x(%rsp),%xmm1	# 0 ?.
+	movmskps	%xmm1,%r9d
+	test		$0x0f,%r9d
+	jz		.L__zn2
+
+	movdqa	%xmm1,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	.L__real_ninf(%rip),%xmm1		# ; C99 specs -inf for +-0
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+
+.L__zn2:
+# check for NaNs
+	movaps	p_x(%rsp),%xmm3
+	andps	.L__real_inf(%rip),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+	movdqa	p_x(%rsp),%xmm4
+	pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+	pcmpeqd	.L__real_zero(%rip),%xmm4
+	pandn	%xmm3,%xmm4							# mask for NaNs
+	movdqa	%xmm4,%xmm2
+	movdqa	p_x(%rsp),%xmm1			# isolate the NaNs
+	pand	%xmm4,%xmm1
+
+	pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+	por		%xmm1,%xmm4							# turn SNaNs to QNaNs
+
+	movdqa	%xmm2,%xmm1
+	andnps	%xmm0,%xmm2							# keep the non-error values
+	orps	%xmm4,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	xorps	%xmm4,%xmm4
+
+	jmp		.L__f2
+
+# handle only +inf	 log(+inf) = inf
+.L__log_inf:
+	movdqa	%xmm3,%xmm1
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	p_x(%rsp),%xmm1			# setup the +inf values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+	jmp		.L__f3
+
+
+        .data
+        .align 64
+
+.L__real_zero:				.quad 0x00000000000000000	# 1.0
+					.quad 0x00000000000000000
+.L__real_one:				.quad 0x03f8000003f800000	# 1.0
+					.quad 0x03f8000003f800000
+.L__real_two:				.quad 0x04000000040000000	# 1.0
+					.quad 0x04000000040000000
+.L__real_ninf:				.quad 0x0ff800000ff800000	# -inf
+					.quad 0x0ff800000ff800000
+.L__real_inf:				.quad 0x07f8000007f800000	# +inf
+					.quad 0x07f8000007f800000
+.L__real_nan:				.quad 0x07fc000007fc00000	# NaN
+					.quad 0x07fc000007fc00000
+.L__real_ef:				.quad 0x0402DF854402DF854	# float e
+					.quad 0x0402DF854402DF854
+
+.L__real_sign:				.quad 0x08000000080000000	# sign bit
+					.quad 0x08000000080000000
+.L__real_notsign:			.quad 0x07ffFFFFF7ffFFFFF	# ^sign bit
+					.quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:			.quad 0x00040000000400000	# quiet nan bit
+					.quad 0x00040000000400000
+.L__real_mant:				.quad 0x0007FFFFF007FFFFF	# mantipsa bits
+					.quad 0x0007FFFFF007FFFFF
+.L__real_3c000000:			.quad 0x03c0000003c000000	# /* 0.0078125 = 1/128 */
+					.quad 0x03c0000003c000000
+.L__mask_127:				.quad 0x00000007f0000007f	#
+					.quad 0x00000007f0000007f
+.L__mask_040:				.quad 0x00000004000000040	#
+					.quad 0x00000004000000040
+.L__mask_001:				.quad 0x00000000100000001	#
+					.quad 0x00000000100000001
+
+
+.L__real_threshold:			.quad 0x03CF5C28F3CF5C28F	# .03
+					.quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333317923934e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.25000000037717509602e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3:				.quad 0x03B1249183B124918	# 2.23213998791944806202e-03
+					.quad 0x03B1249183B124918
+.L__real_ca4:				.quad 0x039E401A639E401A6	# 4.34887777707614552256e-04
+					.quad 0x039E401A639E401A6
+.L__real_cb1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333333593622e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.24999999978138668903e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3:				.quad 0x03B124A123B124A12	# 2.23219810758559851206e-03
+					.quad 0x03B124A123B124A12
+.L__real_log2_lead:        		.quad 0x03F3170003F317000  # 0.693115234375
+                        		.quad 0x03F3170003F317000
+.L__real_log2_tail:        		.quad 0x03805FDF43805FDF4  # 0.000031946183
+                        		.quad 0x03805FDF43805FDF4
+.L__real_half:				.quad 0x03f0000003f000000	# 1/2
+					.quad 0x03f0000003f000000
+.L__real_log10e_lead:	     .quad 0x03EDE00003EDE0000	# log10e_lead  0.4335937500
+                       .quad 0x03EDE00003EDE0000
+.L__real_log10e_tail:	      .quad 0x03A37B1523A37B152  # log10e_tail  0.0007007319
+                       .quad 0x03A37B1523A37B152
+
+
+.L__mask_lower:				.quad 0x0ffff0000ffff0000	#
+					.quad 0x0ffff0000ffff0000
+
+.L__np_ln__table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3F8FC0A8B0FC03E4		# 1.55041813850402832031e-02
+	.quad	0x3F9F829B0E783300		# 3.07716131210327148438e-02
+	.quad	0x3FA77458F632DCFC		# 4.58095073699951171875e-02
+	.quad	0x3FAF0A30C01162A6		# 6.06245994567871093750e-02
+	.quad	0x3FB341D7961BD1D1		# 7.52233862876892089844e-02
+	.quad	0x3FB6F0D28AE56B4C		# 8.96121263504028320312e-02
+	.quad	0x3FBA926D3A4AD563		# 1.03796780109405517578e-01
+	.quad	0x3FBE27076E2AF2E6		# 1.17783010005950927734e-01
+	.quad	0x3FC0D77E7CD08E59		# 1.31576299667358398438e-01
+	.quad	0x3FC29552F81FF523		# 1.45181953907012939453e-01
+	.quad	0x3FC44D2B6CCB7D1E		# 1.58604979515075683594e-01
+	.quad	0x3FC5FF3070A793D4		# 1.71850204467773437500e-01
+	.quad	0x3FC7AB890210D909		# 1.84922337532043457031e-01
+	.quad	0x3FC9525A9CF456B4		# 1.97825729846954345703e-01
+	.quad	0x3FCAF3C94E80BFF3		# 2.10564732551574707031e-01
+	.quad	0x3FCC8FF7C79A9A22		# 2.23143517971038818359e-01
+	.quad	0x3FCE27076E2AF2E6		# 2.35566020011901855469e-01
+	.quad	0x3FCFB9186D5E3E2B		# 2.47836112976074218750e-01
+	.quad	0x3FD0A324E27390E3		# 2.59957492351531982422e-01
+	.quad	0x3FD1675CABABA60E		# 2.71933674812316894531e-01
+	.quad	0x3FD22941FBCF7966		# 2.83768117427825927734e-01
+	.quad	0x3FD2E8E2BAE11D31		# 2.95464158058166503906e-01
+	.quad	0x3FD3A64C556945EA		# 3.07025015354156494141e-01
+	.quad	0x3FD4618BC21C5EC2		# 3.18453729152679443359e-01
+	.quad	0x3FD51AAD872DF82D		# 3.29753279685974121094e-01
+	.quad	0x3FD5D1BDBF5809CA		# 3.40926527976989746094e-01
+	.quad	0x3FD686C81E9B14AF		# 3.51976394653320312500e-01
+	.quad	0x3FD739D7F6BBD007		# 3.62905442714691162109e-01
+	.quad	0x3FD7EAF83B82AFC3		# 3.73716354370117187500e-01
+	.quad	0x3FD89A3386C1425B		# 3.84411692619323730469e-01
+	.quad	0x3FD947941C2116FB		# 3.94993782043457031250e-01
+	.quad	0x3FD9F323ECBF984C		# 4.05465066432952880859e-01
+	.quad	0x3FDA9CEC9A9A084A		# 4.15827870368957519531e-01
+	.quad	0x3FDB44F77BCC8F63		# 4.26084339618682861328e-01
+	.quad	0x3FDBEB4D9DA71B7C		# 4.36236739158630371094e-01
+	.quad	0x3FDC8FF7C79A9A22		# 4.46287095546722412109e-01
+	.quad	0x3FDD32FE7E00EBD5		# 4.56237375736236572266e-01
+	.quad	0x3FDDD46A04C1C4A1		# 4.66089725494384765625e-01
+	.quad	0x3FDE744261D68788		# 4.75845873355865478516e-01
+	.quad	0x3FDF128F5FAF06ED		# 4.85507786273956298828e-01
+	.quad	0x3FDFAF588F78F31F		# 4.95077252388000488281e-01
+	.quad	0x3FE02552A5A5D0FF		# 5.04556000232696533203e-01
+	.quad	0x3FE0723E5C1CDF40		# 5.13945698738098144531e-01
+	.quad	0x3FE0BE72E4252A83		# 5.23248136043548583984e-01
+	.quad	0x3FE109F39E2D4C97		# 5.32464742660522460938e-01
+	.quad	0x3FE154C3D2F4D5EA		# 5.41597247123718261719e-01
+	.quad	0x3FE19EE6B467C96F		# 5.50647079944610595703e-01
+	.quad	0x3FE1E85F5E7040D0		# 5.59615731239318847656e-01
+	.quad	0x3FE23130D7BEBF43		# 5.68504691123962402344e-01
+	.quad	0x3FE2795E1289B11B		# 5.77315330505371093750e-01
+	.quad	0x3FE2C0E9ED448E8C		# 5.86049020290374755859e-01
+	.quad	0x3FE307D7334F10BE		# 5.94707071781158447266e-01
+	.quad	0x3FE34E289D9CE1D3		# 6.03290796279907226562e-01
+	.quad	0x3FE393E0D3562A1A		# 6.11801505088806152344e-01
+	.quad	0x3FE3D9026A7156FB		# 6.20240390300750732422e-01
+	.quad	0x3FE41D8FE84672AE		# 6.28608644008636474609e-01
+	.quad	0x3FE4618BC21C5EC2		# 6.36907458305358886719e-01
+	.quad	0x3FE4A4F85DB03EBB		# 6.45137906074523925781e-01
+	.quad	0x3FE4E7D811B75BB1		# 6.53301239013671875000e-01
+	.quad	0x3FE52A2D265BC5AB		# 6.61398470401763916016e-01
+	.quad	0x3FE56BF9D5B3F399		# 6.69430613517761230469e-01
+	.quad	0x3FE5AD404C359F2D		# 6.77398800849914550781e-01
+	.quad	0x3FE5EE02A9241675		# 6.85303986072540283203e-01
+	.quad	0x3FE62E42FEFA39EF		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_lead_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x3C7E0000  # 0.015502929688 1
+    .long 0x3CFC1000  # 0.030769348145 2
+    .long 0x3D3BA000  # 0.045806884766 3
+    .long 0x3D785000  # 0.060623168945 4
+    .long 0x3D9A0000  # 0.075195312500 5
+    .long 0x3DB78000  # 0.089599609375 6
+    .long 0x3DD49000  # 0.103790283203 7
+    .long 0x3DF13000  # 0.117767333984 8
+    .long 0x3E06B000  # 0.131530761719 9
+    .long 0x3E14A000  # 0.145141601563 10
+    .long 0x3E226000  # 0.158569335938 11
+    .long 0x3E2FF000  # 0.171813964844 12
+    .long 0x3E3D5000  # 0.184875488281 13
+    .long 0x3E4A9000  # 0.197814941406 14
+    .long 0x3E579000  # 0.210510253906 15
+    .long 0x3E647000  # 0.223083496094 16
+    .long 0x3E713000  # 0.235534667969 17
+    .long 0x3E7DC000  # 0.247802734375 18
+    .long 0x3E851000  # 0.259887695313 19
+    .long 0x3E8B3000  # 0.271850585938 20
+    .long 0x3E914000  # 0.283691406250 21
+    .long 0x3E974000  # 0.295410156250 22
+    .long 0x3E9D3000  # 0.307006835938 23
+    .long 0x3EA30000  # 0.318359375000 24
+    .long 0x3EA8D000  # 0.329711914063 25
+    .long 0x3EAE8000  # 0.340820312500 26
+    .long 0x3EB43000  # 0.351928710938 27
+    .long 0x3EB9C000  # 0.362792968750 28
+    .long 0x3EBF5000  # 0.373657226563 29
+    .long 0x3EC4D000  # 0.384399414063 30
+    .long 0x3ECA3000  # 0.394897460938 31
+    .long 0x3ECF9000  # 0.405395507813 32
+    .long 0x3ED4E000  # 0.415771484375 33
+    .long 0x3EDA2000  # 0.426025390625 34
+    .long 0x3EDF5000  # 0.436157226563 35
+    .long 0x3EE47000  # 0.446166992188 36
+    .long 0x3EE99000  # 0.456176757813 37
+    .long 0x3EEEA000  # 0.466064453125 38
+    .long 0x3EF3A000  # 0.475830078125 39
+    .long 0x3EF89000  # 0.485473632813 40
+    .long 0x3EFD7000  # 0.494995117188 41
+    .long 0x3F012000  # 0.504394531250 42
+    .long 0x3F039000  # 0.513916015625 43
+    .long 0x3F05F000  # 0.523193359375 44
+    .long 0x3F084000  # 0.532226562500 45
+    .long 0x3F0AA000  # 0.541503906250 46
+    .long 0x3F0CF000  # 0.550537109375 47
+    .long 0x3F0F4000  # 0.559570312500 48
+    .long 0x3F118000  # 0.568359375000 49
+    .long 0x3F13C000  # 0.577148437500 50
+    .long 0x3F160000  # 0.585937500000 51
+    .long 0x3F183000  # 0.594482421875 52
+    .long 0x3F1A7000  # 0.603271484375 53
+    .long 0x3F1C9000  # 0.611572265625 54
+    .long 0x3F1EC000  # 0.620117187500 55
+    .long 0x3F20E000  # 0.628417968750 56
+    .long 0x3F230000  # 0.636718750000 57
+    .long 0x3F252000  # 0.645019531250 58
+    .long 0x3F273000  # 0.653076171875 59
+    .long 0x3F295000  # 0.661376953125 60
+    .long 0x3F2B5000  # 0.669189453125 61
+    .long 0x3F2D6000  # 0.677246093750 62
+    .long 0x3F2F7000  # 0.685302734375 63
+    .long 0x3F317000  # 0.693115234375 64
+    .long 0					# for alignment
+
+.L__np_ln_tail_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x35A8B0FC  # 0.000001256848 1
+    .long 0x361B0E78  # 0.000002310522 2
+    .long 0x3631EC66  # 0.000002651266 3
+    .long 0x35C30046  # 0.000001452871 4
+    .long 0x37EBCB0E  # 0.000028108738 5
+    .long 0x37528AE5  # 0.000012549314 6
+    .long 0x36DA7496  # 0.000006510479 7
+    .long 0x3783B715  # 0.000015701671 8
+    .long 0x383F3E68  # 0.000045596069 9
+    .long 0x38297C10  # 0.000040408282 10
+    .long 0x3815B666  # 0.000035694240 11
+    .long 0x38183854  # 0.000036292084 12
+    .long 0x38448108  # 0.000046850211 13
+    .long 0x373539E9  # 0.000010801924 14
+    .long 0x3864A740  # 0.000054515200 15
+    .long 0x387BE3CD  # 0.000060055219 16
+    .long 0x3803B715  # 0.000031403342 17
+    .long 0x380C36AF  # 0.000033429529 18
+    .long 0x3892713A  # 0.000069829126 19
+    .long 0x38AE55D6  # 0.000083129547 20
+    .long 0x38A0FDE8  # 0.000076766883 21
+    .long 0x3862BAE1  # 0.000054056643 22
+    .long 0x3798AAD3  # 0.000018199358 23
+    .long 0x38C5E10E  # 0.000094356117 24
+    .long 0x382D872E  # 0.000041372310 25
+    .long 0x38DEDFAC  # 0.000106274470 26
+    .long 0x38481E9B  # 0.000047712219 27
+    .long 0x38EBFB5E  # 0.000112524940 28
+    .long 0x38783B83  # 0.000059183232 29
+    .long 0x374E1B05  # 0.000012284848 30
+    .long 0x38CA0E11  # 0.000096347307 31
+    .long 0x3891F660  # 0.000069600297 32
+    .long 0x386C9A9A  # 0.000056410769 33
+    .long 0x38777BCD  # 0.000059004688 34
+    .long 0x38A6CED4  # 0.000079540216 35
+    .long 0x38FBE3CD  # 0.000120110439 36
+    .long 0x387E7E01  # 0.000060675669 37
+    .long 0x37D40984  # 0.000025276800 38
+    .long 0x3784C3AD  # 0.000015826745 39
+    .long 0x380F5FAF  # 0.000034182969 40
+    .long 0x38AC47BC  # 0.000082149607 41
+    .long 0x392952D3  # 0.000161479504 42
+    .long 0x37F97073  # 0.000029735476 43
+    .long 0x3865C84A  # 0.000054784388 44
+    .long 0x3979CF17  # 0.000238236375 45
+    .long 0x38C3D2F5  # 0.000093376184 46
+    .long 0x38E6B468  # 0.000110008579 47
+    .long 0x383EBCE1  # 0.000045475437 48
+    .long 0x39186BDF  # 0.000145360347 49
+    .long 0x392F0945  # 0.000166927537 50
+    .long 0x38E9ED45  # 0.000111545007 51
+    .long 0x396B99A8  # 0.000224685878 52
+    .long 0x37A27674  # 0.000019367064 53
+    .long 0x397069AB  # 0.000229275480 54
+    .long 0x39013539  # 0.000123222257 55
+    .long 0x3947F423  # 0.000190690669 56
+    .long 0x3945E10E  # 0.000188712234 57
+    .long 0x38F85DB0  # 0.000118430122 58
+    .long 0x396C08DC  # 0.000225100142 59
+    .long 0x37B4996F  # 0.000021529120 60
+    .long 0x397CEADA  # 0.000241200818 61
+    .long 0x3920261B  # 0.000152729845 62
+    .long 0x35AA4906  # 0.000001268724 63
+    .long 0x3805FDF4  # 0.000031946183 64
+    .long 0					# for alignment
+

diff --git a/src/gas/vrs4log2f.S b/src/gas/vrs4log2f.S
new file mode 100644
index 0000000..05185b2
--- /dev/null
+++ b/src/gas/vrs4log2f.S

@@ -0,0 +1,639 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4log2f.s
+#
+# A vector implementation of the logf libm function.
+#  This routine implemented in single precision.  It is slightly
+#  less accurate than the double precision version, but it will
+#  be better for vectorizing.
+#
+# Prototype:
+#
+#     __m128 __vrs4_log2f(__m128 x);
+#
+#   Computes the natural log of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+        .text
+        .align 16
+        .p2align 4,,15
+
+# define local variable storage offsets
+.equ	p_x,0			# save x
+.equ	p_idx,0x010		# xmmword index
+.equ	p_z1,0x020		# xmmword index
+.equ	p_q,0x030		# xmmword index
+.equ	p_corr,0x040		# xmmword index
+.equ	p_omask,0x050		# xmmword index
+.equ	save_xmm6,0x060		#
+.equ	save_rbx,0x070		#
+
+.equ	stack_size,0x088
+
+
+
+.globl __vrs4_log2f
+    .type   __vrs4_log2f,@function
+__vrs4_log2f:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+
+# check 2 as a special case
+	movdqa	%xmm0,p_x(%rsp)	# save x
+#	movdqa	%xmm0,%xmm2
+#	cmpps	$0,.L__real_ef(%rip),%xmm2
+#	movmskps	%xmm2,%r9d
+
+#
+# compute the index into the log tables
+#
+	movdqa	%xmm0,%xmm3
+	movaps	%xmm0,%xmm1
+	psrld	$23,%xmm3
+	subps	.L__real_one(%rip),%xmm1
+	psubd	.L__mask_127(%rip),%xmm3
+	cvtdq2ps	%xmm3,%xmm6			# xexp
+
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+	xor		%r8,%r8
+	movdqa	%xmm3,%xmm2
+	movaps	.L__real_half(%rip),%xmm5							# .5
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrld	$16,%xmm3
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	movdqa	%xmm3,%xmm4
+	psrld	$1,%xmm3
+	paddd	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm4
+	paddd	%xmm4,%xmm3
+	cvtdq2ps	%xmm3,%xmm1
+	packssdw	%xmm3,%xmm3
+	movq	%xmm3,p_idx(%rsp)
+
+
+# reduce and get u
+	movdqa	%xmm0,%xmm3
+	orps		.L__real_half(%rip),%xmm2
+
+
+	mulps	.L__real_3c000000(%rip),%xmm1				# f1 = index/128
+
+
+	subps	%xmm1,%xmm2											# f2 = f - f1
+	mulps	%xmm2,%xmm5
+	addps	%xmm5,%xmm1
+
+	divps	%xmm1,%xmm2				# u
+
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1(%rsp) 			# save the f1 values
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1+8(%rsp) 			# save the f1 value
+
+# solve for ln(1+u)
+	movaps	%xmm2,%xmm1				# u
+	mulps	%xmm2,%xmm2				# u^2
+	movaps	%xmm2,%xmm5
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	movaps	.L__real_cb3(%rip),%xmm3
+	mulps	%xmm2,%xmm3				#Cu2
+	mulps	%xmm1,%xmm5				# u^3
+	addps	.L__real_cb2(%rip),%xmm3 #B+Cu2
+	movaps	%xmm2,%xmm4
+	mulps	%xmm5,%xmm4				# u^5
+	movaps	.L__real_log2_lead(%rip),%xmm2
+	movaps	.L__real_log2e_lead(%rip),%xmm2
+
+	mulps	.L__real_cb1(%rip),%xmm5 #Au3
+	addps	%xmm5,%xmm1				# u+Au3
+	mulps	%xmm3,%xmm4				# u5(B+Cu2)
+	movaps	.L__real_log2e_tail(%rip),%xmm3
+	addps	%xmm4,%xmm1				# poly
+
+# recombine
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q(%rsp) 			# save the f1 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q+8(%rsp) 			# save the f1 value
+
+	addps	p_q(%rsp),%xmm1 #z2	+=q
+	movaps	%xmm1,%xmm4	#z2 copy
+	movaps	p_z1(%rsp),%xmm0			# z1  values
+	movaps	%xmm0,%xmm5	#z1 copy
+	mulps	%xmm2,%xmm5	#z1*log2e_lead
+	mulps	%xmm2,%xmm1	#z2*log2e_lead
+	mulps	%xmm3,%xmm4	#z2*log2e_tail
+	mulps	%xmm3,%xmm0	#z1*log2e_tail
+	addps	%xmm6,%xmm5	#r1 = z1*log2e_lead + xexp
+	addps	%xmm4,%xmm0	#z1*log2e_tail + z2*log2e_tail
+	addps	%xmm1,%xmm0	#r2
+#return r1+r2
+	addps 	%xmm5,%xmm0	# r1+ r2
+# check for e
+#	test		$0x0f,%r9d
+#	jnz			.L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+	xorps	%xmm1,%xmm1
+	cmpps	$1,p_x(%rsp),%xmm1	# 0 greater than =?. catches NaNs also.
+	movmskps	%xmm1,%r9d
+	cmp		$0x0f,%r9d
+	jnz		.L__z_or_neg
+
+.L__f2:
+##  if +inf
+	movaps	p_x(%rsp),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test		$0x0f,%r9d
+	jnz		.L__log_inf
+.L__f3:
+
+	movaps	p_x(%rsp),%xmm3
+	subps	.L__real_one(%rip),%xmm3
+	andps	.L__real_notsign(%rip),%xmm3
+	cmpps	$2,.L__real_threshold(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test	$0x0f,%r9d
+	jnz		.L__near_one
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+.L__vlogf_e:
+	movdqa	p_x(%rsp),%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movdqa	%xmm2,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-e values
+	andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+	orps	%xmm3,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	jmp		.L__f1
+
+	.align	16
+.L__near_one:
+# saves 10 cycles
+#      r = x - 1.0;
+	movdqa	%xmm3,p_omask(%rsp)	# save ones mask
+	movaps	p_x(%rsp),%xmm3
+	movaps	.L__real_two(%rip),%xmm2
+	subps	.L__real_one(%rip),%xmm3	   # r
+#      u          = r / (2.0 + r);
+	addps	%xmm3,%xmm2
+	movaps	%xmm3,%xmm1
+	divps	%xmm2,%xmm1		# u
+	movaps	.L__real_ca4(%rip),%xmm4	  #D
+	movaps	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movaps	%xmm3,%xmm6
+	mulps	%xmm1,%xmm6		# correction
+	movdqa	%xmm6,p_corr(%rsp)	# save correction
+#      u          = u + u;
+	addps	%xmm1,%xmm1		#u
+	movaps	%xmm1,%xmm2
+	mulps	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulps	%xmm1,%xmm5		# Cu
+	movaps	%xmm1,%xmm6
+	mulps	%xmm2,%xmm6		# u^3
+	mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulps	%xmm6,%xmm4		#Du^3
+
+	addps	.L__real_ca1(%rip),%xmm2	# +A
+	movaps	%xmm6,%xmm1
+	mulps	%xmm1,%xmm1		# u^6
+	addps	%xmm4,%xmm5		#Cu+Du3
+
+	mulps	%xmm6,%xmm2		#u3(A+Bu2)
+	mulps	%xmm5,%xmm1		#u6(Cu+Du3)
+	addps	%xmm1,%xmm2
+	subps	p_corr(%rsp),%xmm2		# -correction
+
+	movaps  %xmm3,%xmm5 	#r1=r
+	pand 	.L__mask_lower(%rip),%xmm5
+	subps	%xmm5,%xmm3
+	addps	%xmm3,%xmm2	#r2 = r2 + (r-r1)
+
+	movaps	%xmm5,%xmm3
+	movaps	%xmm2,%xmm1
+
+	mulps 	.L__real_log2e_tail(%rip),%xmm2
+	mulps 	.L__real_log2e_tail(%rip),%xmm3
+	mulps 	.L__real_log2e_lead(%rip),%xmm1
+	mulps 	.L__real_log2e_lead(%rip),%xmm5
+	addps 	%xmm2,%xmm3
+	addps 	%xmm1,%xmm3
+	addps	%xmm5,%xmm3
+#      return r + r2;
+#	addps	%xmm2,%xmm3
+
+	movdqa	p_omask(%rsp),%xmm6
+	movdqa	%xmm6,%xmm2
+	andnps	%xmm0,%xmm6					# keep the non-nearone values
+	andps	%xmm3,%xmm2					# setup the nearone values
+	orps	%xmm6,%xmm2					# merge
+	movdqa	%xmm2,%xmm0					# and replace
+
+	jmp		.L__finish
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+	movdqa	%xmm1,%xmm3
+	andps	%xmm0,%xmm3							# keep the non-error values
+	andnps	.L__real_nan(%rip),%xmm1			# setup the nan values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+# check for +/- 0
+	xorps	%xmm1,%xmm1
+	cmpps	$0,p_x(%rsp),%xmm1	# 0 ?.
+	movmskps	%xmm1,%r9d
+	test		$0x0f,%r9d
+	jz		.L__zn2
+
+	movdqa	%xmm1,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	.L__real_ninf(%rip),%xmm1		# ; C99 specs -inf for +-0
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+
+.L__zn2:
+# check for NaNs
+	movaps	p_x(%rsp),%xmm3
+	andps	.L__real_inf(%rip),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+	movdqa	p_x(%rsp),%xmm4
+	pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+	pcmpeqd	.L__real_zero(%rip),%xmm4
+	pandn	%xmm3,%xmm4							# mask for NaNs
+	movdqa	%xmm4,%xmm2
+	movdqa	p_x(%rsp),%xmm1			# isolate the NaNs
+	pand	%xmm4,%xmm1
+
+	pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+	por		%xmm1,%xmm4							# turn SNaNs to QNaNs
+
+	movdqa	%xmm2,%xmm1
+	andnps	%xmm0,%xmm2							# keep the non-error values
+	orps	%xmm4,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	xorps	%xmm4,%xmm4
+
+	jmp		.L__f2
+
+# handle only +inf	 log(+inf) = inf
+.L__log_inf:
+	movdqa	%xmm3,%xmm1
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	p_x(%rsp),%xmm1			# setup the +inf values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+	jmp		.L__f3
+
+
+        .data
+        .align 64
+
+.L__real_zero:				.quad 0x00000000000000000	# 1.0
+					.quad 0x00000000000000000
+.L__real_one:				.quad 0x03f8000003f800000	# 1.0
+					.quad 0x03f8000003f800000
+.L__real_two:				.quad 0x04000000040000000	# 1.0
+					.quad 0x04000000040000000
+.L__real_ninf:				.quad 0x0ff800000ff800000	# -inf
+					.quad 0x0ff800000ff800000
+.L__real_inf:				.quad 0x07f8000007f800000	# +inf
+					.quad 0x07f8000007f800000
+.L__real_nan:				.quad 0x07fc000007fc00000	# NaN
+					.quad 0x07fc000007fc00000
+.L__real_ef:				.quad 0x0402DF854402DF854	# float e
+					.quad 0x0402DF854402DF854
+
+.L__real_sign:				.quad 0x08000000080000000	# sign bit
+					.quad 0x08000000080000000
+.L__real_notsign:			.quad 0x07ffFFFFF7ffFFFFF	# ^sign bit
+					.quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:			.quad 0x00040000000400000	# quiet nan bit
+					.quad 0x00040000000400000
+.L__real_mant:				.quad 0x0007FFFFF007FFFFF	# mantipsa bits
+					.quad 0x0007FFFFF007FFFFF
+.L__real_3c000000:			.quad 0x03c0000003c000000	# /* 0.0078125 = 1/128 */
+					.quad 0x03c0000003c000000
+.L__mask_127:				.quad 0x00000007f0000007f	#
+					.quad 0x00000007f0000007f
+.L__mask_040:				.quad 0x00000004000000040	#
+					.quad 0x00000004000000040
+.L__mask_001:				.quad 0x00000000100000001	#
+					.quad 0x00000000100000001
+
+
+.L__real_threshold:			.quad 0x03CF5C28F3CF5C28F	# .03
+					.quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333317923934e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.25000000037717509602e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3:				.quad 0x03B1249183B124918	# 2.23213998791944806202e-03
+					.quad 0x03B1249183B124918
+.L__real_ca4:				.quad 0x039E401A639E401A6	# 4.34887777707614552256e-04
+					.quad 0x039E401A639E401A6
+.L__real_cb1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333333593622e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.24999999978138668903e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3:				.quad 0x03B124A123B124A12	# 2.23219810758559851206e-03
+					.quad 0x03B124A123B124A12
+.L__real_log2_lead:        		.quad 0x03F3170003F317000  # 0.693115234375
+                        		.quad 0x03F3170003F317000
+.L__real_log2_tail:        		.quad 0x03805FDF43805FDF4  # 0.000031946183
+                        		.quad 0x03805FDF43805FDF4
+.L__real_half:				.quad 0x03f0000003f000000	# 1/2
+					.quad 0x03f0000003f000000
+.L__real_log2e_lead:       .quad 0x03FB800003FB80000  #1.4375000000
+                        .quad 0x03FB800003FB80000
+.L__real_log2e_tail:       .quad 0x03BAA3B293BAA3B29  # 0.0051950408889633
+                        .quad 0x03BAA3B293BAA3B29
+
+.L__mask_lower:			.quad 0x0ffff0000ffff0000	#
+						.quad 0x0ffff0000ffff0000
+
+.L__np_ln__table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3F8FC0A8B0FC03E4		# 1.55041813850402832031e-02
+	.quad	0x3F9F829B0E783300		# 3.07716131210327148438e-02
+	.quad	0x3FA77458F632DCFC		# 4.58095073699951171875e-02
+	.quad	0x3FAF0A30C01162A6		# 6.06245994567871093750e-02
+	.quad	0x3FB341D7961BD1D1		# 7.52233862876892089844e-02
+	.quad	0x3FB6F0D28AE56B4C		# 8.96121263504028320312e-02
+	.quad	0x3FBA926D3A4AD563		# 1.03796780109405517578e-01
+	.quad	0x3FBE27076E2AF2E6		# 1.17783010005950927734e-01
+	.quad	0x3FC0D77E7CD08E59		# 1.31576299667358398438e-01
+	.quad	0x3FC29552F81FF523		# 1.45181953907012939453e-01
+	.quad	0x3FC44D2B6CCB7D1E		# 1.58604979515075683594e-01
+	.quad	0x3FC5FF3070A793D4		# 1.71850204467773437500e-01
+	.quad	0x3FC7AB890210D909		# 1.84922337532043457031e-01
+	.quad	0x3FC9525A9CF456B4		# 1.97825729846954345703e-01
+	.quad	0x3FCAF3C94E80BFF3		# 2.10564732551574707031e-01
+	.quad	0x3FCC8FF7C79A9A22		# 2.23143517971038818359e-01
+	.quad	0x3FCE27076E2AF2E6		# 2.35566020011901855469e-01
+	.quad	0x3FCFB9186D5E3E2B		# 2.47836112976074218750e-01
+	.quad	0x3FD0A324E27390E3		# 2.59957492351531982422e-01
+	.quad	0x3FD1675CABABA60E		# 2.71933674812316894531e-01
+	.quad	0x3FD22941FBCF7966		# 2.83768117427825927734e-01
+	.quad	0x3FD2E8E2BAE11D31		# 2.95464158058166503906e-01
+	.quad	0x3FD3A64C556945EA		# 3.07025015354156494141e-01
+	.quad	0x3FD4618BC21C5EC2		# 3.18453729152679443359e-01
+	.quad	0x3FD51AAD872DF82D		# 3.29753279685974121094e-01
+	.quad	0x3FD5D1BDBF5809CA		# 3.40926527976989746094e-01
+	.quad	0x3FD686C81E9B14AF		# 3.51976394653320312500e-01
+	.quad	0x3FD739D7F6BBD007		# 3.62905442714691162109e-01
+	.quad	0x3FD7EAF83B82AFC3		# 3.73716354370117187500e-01
+	.quad	0x3FD89A3386C1425B		# 3.84411692619323730469e-01
+	.quad	0x3FD947941C2116FB		# 3.94993782043457031250e-01
+	.quad	0x3FD9F323ECBF984C		# 4.05465066432952880859e-01
+	.quad	0x3FDA9CEC9A9A084A		# 4.15827870368957519531e-01
+	.quad	0x3FDB44F77BCC8F63		# 4.26084339618682861328e-01
+	.quad	0x3FDBEB4D9DA71B7C		# 4.36236739158630371094e-01
+	.quad	0x3FDC8FF7C79A9A22		# 4.46287095546722412109e-01
+	.quad	0x3FDD32FE7E00EBD5		# 4.56237375736236572266e-01
+	.quad	0x3FDDD46A04C1C4A1		# 4.66089725494384765625e-01
+	.quad	0x3FDE744261D68788		# 4.75845873355865478516e-01
+	.quad	0x3FDF128F5FAF06ED		# 4.85507786273956298828e-01
+	.quad	0x3FDFAF588F78F31F		# 4.95077252388000488281e-01
+	.quad	0x3FE02552A5A5D0FF		# 5.04556000232696533203e-01
+	.quad	0x3FE0723E5C1CDF40		# 5.13945698738098144531e-01
+	.quad	0x3FE0BE72E4252A83		# 5.23248136043548583984e-01
+	.quad	0x3FE109F39E2D4C97		# 5.32464742660522460938e-01
+	.quad	0x3FE154C3D2F4D5EA		# 5.41597247123718261719e-01
+	.quad	0x3FE19EE6B467C96F		# 5.50647079944610595703e-01
+	.quad	0x3FE1E85F5E7040D0		# 5.59615731239318847656e-01
+	.quad	0x3FE23130D7BEBF43		# 5.68504691123962402344e-01
+	.quad	0x3FE2795E1289B11B		# 5.77315330505371093750e-01
+	.quad	0x3FE2C0E9ED448E8C		# 5.86049020290374755859e-01
+	.quad	0x3FE307D7334F10BE		# 5.94707071781158447266e-01
+	.quad	0x3FE34E289D9CE1D3		# 6.03290796279907226562e-01
+	.quad	0x3FE393E0D3562A1A		# 6.11801505088806152344e-01
+	.quad	0x3FE3D9026A7156FB		# 6.20240390300750732422e-01
+	.quad	0x3FE41D8FE84672AE		# 6.28608644008636474609e-01
+	.quad	0x3FE4618BC21C5EC2		# 6.36907458305358886719e-01
+	.quad	0x3FE4A4F85DB03EBB		# 6.45137906074523925781e-01
+	.quad	0x3FE4E7D811B75BB1		# 6.53301239013671875000e-01
+	.quad	0x3FE52A2D265BC5AB		# 6.61398470401763916016e-01
+	.quad	0x3FE56BF9D5B3F399		# 6.69430613517761230469e-01
+	.quad	0x3FE5AD404C359F2D		# 6.77398800849914550781e-01
+	.quad	0x3FE5EE02A9241675		# 6.85303986072540283203e-01
+	.quad	0x3FE62E42FEFA39EF		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_lead_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x3C7E0000  # 0.015502929688 1
+    .long 0x3CFC1000  # 0.030769348145 2
+    .long 0x3D3BA000  # 0.045806884766 3
+    .long 0x3D785000  # 0.060623168945 4
+    .long 0x3D9A0000  # 0.075195312500 5
+    .long 0x3DB78000  # 0.089599609375 6
+    .long 0x3DD49000  # 0.103790283203 7
+    .long 0x3DF13000  # 0.117767333984 8
+    .long 0x3E06B000  # 0.131530761719 9
+    .long 0x3E14A000  # 0.145141601563 10
+    .long 0x3E226000  # 0.158569335938 11
+    .long 0x3E2FF000  # 0.171813964844 12
+    .long 0x3E3D5000  # 0.184875488281 13
+    .long 0x3E4A9000  # 0.197814941406 14
+    .long 0x3E579000  # 0.210510253906 15
+    .long 0x3E647000  # 0.223083496094 16
+    .long 0x3E713000  # 0.235534667969 17
+    .long 0x3E7DC000  # 0.247802734375 18
+    .long 0x3E851000  # 0.259887695313 19
+    .long 0x3E8B3000  # 0.271850585938 20
+    .long 0x3E914000  # 0.283691406250 21
+    .long 0x3E974000  # 0.295410156250 22
+    .long 0x3E9D3000  # 0.307006835938 23
+    .long 0x3EA30000  # 0.318359375000 24
+    .long 0x3EA8D000  # 0.329711914063 25
+    .long 0x3EAE8000  # 0.340820312500 26
+    .long 0x3EB43000  # 0.351928710938 27
+    .long 0x3EB9C000  # 0.362792968750 28
+    .long 0x3EBF5000  # 0.373657226563 29
+    .long 0x3EC4D000  # 0.384399414063 30
+    .long 0x3ECA3000  # 0.394897460938 31
+    .long 0x3ECF9000  # 0.405395507813 32
+    .long 0x3ED4E000  # 0.415771484375 33
+    .long 0x3EDA2000  # 0.426025390625 34
+    .long 0x3EDF5000  # 0.436157226563 35
+    .long 0x3EE47000  # 0.446166992188 36
+    .long 0x3EE99000  # 0.456176757813 37
+    .long 0x3EEEA000  # 0.466064453125 38
+    .long 0x3EF3A000  # 0.475830078125 39
+    .long 0x3EF89000  # 0.485473632813 40
+    .long 0x3EFD7000  # 0.494995117188 41
+    .long 0x3F012000  # 0.504394531250 42
+    .long 0x3F039000  # 0.513916015625 43
+    .long 0x3F05F000  # 0.523193359375 44
+    .long 0x3F084000  # 0.532226562500 45
+    .long 0x3F0AA000  # 0.541503906250 46
+    .long 0x3F0CF000  # 0.550537109375 47
+    .long 0x3F0F4000  # 0.559570312500 48
+    .long 0x3F118000  # 0.568359375000 49
+    .long 0x3F13C000  # 0.577148437500 50
+    .long 0x3F160000  # 0.585937500000 51
+    .long 0x3F183000  # 0.594482421875 52
+    .long 0x3F1A7000  # 0.603271484375 53
+    .long 0x3F1C9000  # 0.611572265625 54
+    .long 0x3F1EC000  # 0.620117187500 55
+    .long 0x3F20E000  # 0.628417968750 56
+    .long 0x3F230000  # 0.636718750000 57
+    .long 0x3F252000  # 0.645019531250 58
+    .long 0x3F273000  # 0.653076171875 59
+    .long 0x3F295000  # 0.661376953125 60
+    .long 0x3F2B5000  # 0.669189453125 61
+    .long 0x3F2D6000  # 0.677246093750 62
+    .long 0x3F2F7000  # 0.685302734375 63
+    .long 0x3F317000  # 0.693115234375 64
+    .long 0					# for alignment
+
+.L__np_ln_tail_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x35A8B0FC  # 0.000001256848 1
+    .long 0x361B0E78  # 0.000002310522 2
+    .long 0x3631EC66  # 0.000002651266 3
+    .long 0x35C30046  # 0.000001452871 4
+    .long 0x37EBCB0E  # 0.000028108738 5
+    .long 0x37528AE5  # 0.000012549314 6
+    .long 0x36DA7496  # 0.000006510479 7
+    .long 0x3783B715  # 0.000015701671 8
+    .long 0x383F3E68  # 0.000045596069 9
+    .long 0x38297C10  # 0.000040408282 10
+    .long 0x3815B666  # 0.000035694240 11
+    .long 0x38183854  # 0.000036292084 12
+    .long 0x38448108  # 0.000046850211 13
+    .long 0x373539E9  # 0.000010801924 14
+    .long 0x3864A740  # 0.000054515200 15
+    .long 0x387BE3CD  # 0.000060055219 16
+    .long 0x3803B715  # 0.000031403342 17
+    .long 0x380C36AF  # 0.000033429529 18
+    .long 0x3892713A  # 0.000069829126 19
+    .long 0x38AE55D6  # 0.000083129547 20
+    .long 0x38A0FDE8  # 0.000076766883 21
+    .long 0x3862BAE1  # 0.000054056643 22
+    .long 0x3798AAD3  # 0.000018199358 23
+    .long 0x38C5E10E  # 0.000094356117 24
+    .long 0x382D872E  # 0.000041372310 25
+    .long 0x38DEDFAC  # 0.000106274470 26
+    .long 0x38481E9B  # 0.000047712219 27
+    .long 0x38EBFB5E  # 0.000112524940 28
+    .long 0x38783B83  # 0.000059183232 29
+    .long 0x374E1B05  # 0.000012284848 30
+    .long 0x38CA0E11  # 0.000096347307 31
+    .long 0x3891F660  # 0.000069600297 32
+    .long 0x386C9A9A  # 0.000056410769 33
+    .long 0x38777BCD  # 0.000059004688 34
+    .long 0x38A6CED4  # 0.000079540216 35
+    .long 0x38FBE3CD  # 0.000120110439 36
+    .long 0x387E7E01  # 0.000060675669 37
+    .long 0x37D40984  # 0.000025276800 38
+    .long 0x3784C3AD  # 0.000015826745 39
+    .long 0x380F5FAF  # 0.000034182969 40
+    .long 0x38AC47BC  # 0.000082149607 41
+    .long 0x392952D3  # 0.000161479504 42
+    .long 0x37F97073  # 0.000029735476 43
+    .long 0x3865C84A  # 0.000054784388 44
+    .long 0x3979CF17  # 0.000238236375 45
+    .long 0x38C3D2F5  # 0.000093376184 46
+    .long 0x38E6B468  # 0.000110008579 47
+    .long 0x383EBCE1  # 0.000045475437 48
+    .long 0x39186BDF  # 0.000145360347 49
+    .long 0x392F0945  # 0.000166927537 50
+    .long 0x38E9ED45  # 0.000111545007 51
+    .long 0x396B99A8  # 0.000224685878 52
+    .long 0x37A27674  # 0.000019367064 53
+    .long 0x397069AB  # 0.000229275480 54
+    .long 0x39013539  # 0.000123222257 55
+    .long 0x3947F423  # 0.000190690669 56
+    .long 0x3945E10E  # 0.000188712234 57
+    .long 0x38F85DB0  # 0.000118430122 58
+    .long 0x396C08DC  # 0.000225100142 59
+    .long 0x37B4996F  # 0.000021529120 60
+    .long 0x397CEADA  # 0.000241200818 61
+    .long 0x3920261B  # 0.000152729845 62
+    .long 0x35AA4906  # 0.000001268724 63
+    .long 0x3805FDF4  # 0.000031946183 64
+    .long 0					# for alignment
+

diff --git a/src/gas/vrs4logf.S b/src/gas/vrs4logf.S
new file mode 100644
index 0000000..4a39f1c
--- /dev/null
+++ b/src/gas/vrs4logf.S

@@ -0,0 +1,614 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4logf.s
+#
+# A vector implementation of the logf libm function.
+#  This routine implemented in single precision.  It is slightly
+#  less accurate than the double precision version, but it will
+#  be better for vectorizing.
+#
+# Prototype:
+#
+#     __m128 __vrs4_logf(__m128 x);
+#
+#   Computes the natural log of x.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+        .text
+        .align 16
+        .p2align 4,,15
+
+# define local variable storage offsets
+.equ	p_x,0			# save x
+.equ	p_idx,0x010		# xmmword index
+.equ	p_z1,0x020		# xmmword index
+.equ	p_q,0x030		# xmmword index
+.equ	p_corr,0x040		# xmmword index
+.equ	p_omask,0x050		# xmmword index
+.equ	save_xmm6,0x060		#
+.equ	save_rbx,0x070		#
+
+.equ	stack_size,0x088
+
+
+
+.globl __vrs4_logf
+    .type   __vrs4_logf,@function
+__vrs4_logf:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+
+# check e as a special case
+	movdqa	%xmm0,p_x(%rsp)	# save x
+	movdqa	%xmm0,%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movmskps	%xmm2,%r9d
+
+#
+# compute the index into the log tables
+#
+	movdqa	%xmm0,%xmm3
+	movaps	%xmm0,%xmm1
+	psrld	$23,%xmm3
+	subps	.L__real_one(%rip),%xmm1
+	psubd	.L__mask_127(%rip),%xmm3
+	cvtdq2ps	%xmm3,%xmm6			# xexp
+
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+	xor		%r8,%r8
+	movdqa	%xmm3,%xmm2
+	movaps	.L__real_half(%rip),%xmm5							# .5
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrld	$16,%xmm3
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	movdqa	%xmm3,%xmm4
+	psrld	$1,%xmm3
+	paddd	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm4
+	paddd	%xmm4,%xmm3
+	cvtdq2ps	%xmm3,%xmm1
+	packssdw	%xmm3,%xmm3
+	movq	%xmm3,p_idx(%rsp)
+
+
+# reduce and get u
+	movdqa	%xmm0,%xmm3
+	orps		.L__real_half(%rip),%xmm2
+
+
+	mulps	.L__real_3c000000(%rip),%xmm1				# f1 = index/128
+
+
+	subps	%xmm1,%xmm2											# f2 = f - f1
+	mulps	%xmm2,%xmm5
+	addps	%xmm5,%xmm1
+
+	divps	%xmm1,%xmm2				# u
+
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1(%rsp) 			# save the f1 values
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1+8(%rsp) 			# save the f1 value
+
+# solve for ln(1+u)
+	movaps	%xmm2,%xmm1				# u
+	mulps	%xmm2,%xmm2				# u^2
+	movaps	%xmm2,%xmm5
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	movaps	.L__real_cb3(%rip),%xmm3
+	mulps	%xmm2,%xmm3				#Cu2
+	mulps	%xmm1,%xmm5				# u^3
+	addps	.L__real_cb2(%rip),%xmm3 #B+Cu2
+	movaps	%xmm2,%xmm4
+	mulps	%xmm5,%xmm4				# u^5
+	movaps	.L__real_log2_lead(%rip),%xmm2
+
+	mulps	.L__real_cb1(%rip),%xmm5 #Au3
+	addps	%xmm5,%xmm1				# u+Au3
+	mulps	%xmm3,%xmm4				# u5(B+Cu2)
+
+	addps	%xmm4,%xmm1				# poly
+
+# recombine
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q(%rsp) 			# save the f1 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q+8(%rsp) 			# save the f1 value
+
+	addps	p_q(%rsp),%xmm1 #z2	+=q
+
+	movaps	p_z1(%rsp),%xmm0			# z1  values
+
+	mulps	%xmm6,%xmm2
+	addps	%xmm2,%xmm0				#r1
+	mulps	.L__real_log2_tail(%rip),%xmm6
+	addps	%xmm6,%xmm1				#r2
+	addps	%xmm1,%xmm0
+
+# check for e
+	test		$0x0f,%r9d
+	jnz			.L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+	xorps	%xmm1,%xmm1
+	cmpps	$1,p_x(%rsp),%xmm1	# 0 greater than =?. catches NaNs also.
+	movmskps	%xmm1,%r9d
+	cmp		$0x0f,%r9d
+	jnz		.L__z_or_neg
+
+.L__f2:
+##  if +inf
+	movaps	p_x(%rsp),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test		$0x0f,%r9d
+	jnz		.L__log_inf
+.L__f3:
+
+	movaps	p_x(%rsp),%xmm3
+	subps	.L__real_one(%rip),%xmm3
+	andps	.L__real_notsign(%rip),%xmm3
+	cmpps	$2,.L__real_threshold(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test	$0x0f,%r9d
+	jnz		.L__near_one
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+.L__vlogf_e:
+	movdqa	p_x(%rsp),%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movdqa	%xmm2,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-e values
+	andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+	orps	%xmm3,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	jmp		.L__f1
+
+	.align	16
+.L__near_one:
+# saves 10 cycles
+#      r = x - 1.0;
+	movdqa	%xmm3,p_omask(%rsp)	# save ones mask
+	movaps	p_x(%rsp),%xmm3
+	movaps	.L__real_two(%rip),%xmm2
+	subps	.L__real_one(%rip),%xmm3	   # r
+#      u          = r / (2.0 + r);
+	addps	%xmm3,%xmm2
+	movaps	%xmm3,%xmm1
+	divps	%xmm2,%xmm1		# u
+	movaps	.L__real_ca4(%rip),%xmm4	  #D
+	movaps	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movaps	%xmm3,%xmm6
+	mulps	%xmm1,%xmm6		# correction
+	movdqa	%xmm6,p_corr(%rsp)	# save correction
+#      u          = u + u;
+	addps	%xmm1,%xmm1		#u
+	movaps	%xmm1,%xmm2
+	mulps	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulps	%xmm1,%xmm5		# Cu
+	movaps	%xmm1,%xmm6
+	mulps	%xmm2,%xmm6		# u^3
+	mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulps	%xmm6,%xmm4		#Du^3
+
+	addps	.L__real_ca1(%rip),%xmm2	# +A
+	movaps	%xmm6,%xmm1
+	mulps	%xmm1,%xmm1		# u^6
+	addps	%xmm4,%xmm5		#Cu+Du3
+
+	mulps	%xmm6,%xmm2		#u3(A+Bu2)
+	mulps	%xmm5,%xmm1		#u6(Cu+Du3)
+	addps	%xmm1,%xmm2
+	subps	p_corr(%rsp),%xmm2		# -correction
+
+#      return r + r2;
+	addps	%xmm2,%xmm3
+
+	movdqa	p_omask(%rsp),%xmm6
+	movdqa	%xmm6,%xmm2
+	andnps	%xmm0,%xmm6					# keep the non-nearone values
+	andps	%xmm3,%xmm2					# setup the nearone values
+	orps	%xmm6,%xmm2					# merge
+	movdqa	%xmm2,%xmm0					# and replace
+
+	jmp		.L__finish
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+	movdqa	%xmm1,%xmm3
+	andps	%xmm0,%xmm3							# keep the non-error values
+	andnps	.L__real_nan(%rip),%xmm1			# setup the nan values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+# check for +/- 0
+	xorps	%xmm1,%xmm1
+	cmpps	$0,p_x(%rsp),%xmm1	# 0 ?.
+	movmskps	%xmm1,%r9d
+	test		$0x0f,%r9d
+	jz		.L__zn2
+
+	movdqa	%xmm1,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	.L__real_ninf(%rip),%xmm1		# ; C99 specs -inf for +-0
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+
+.L__zn2:
+# check for NaNs
+	movaps	p_x(%rsp),%xmm3
+	andps	.L__real_inf(%rip),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+	movdqa	p_x(%rsp),%xmm4
+	pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+	pcmpeqd	.L__real_zero(%rip),%xmm4
+	pandn	%xmm3,%xmm4							# mask for NaNs
+	movdqa	%xmm4,%xmm2
+	movdqa	p_x(%rsp),%xmm1			# isolate the NaNs
+	pand	%xmm4,%xmm1
+
+	pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+	por		%xmm1,%xmm4							# turn SNaNs to QNaNs
+
+	movdqa	%xmm2,%xmm1
+	andnps	%xmm0,%xmm2							# keep the non-error values
+	orps	%xmm4,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	xorps	%xmm4,%xmm4
+
+	jmp		.L__f2
+
+# handle only +inf	 log(+inf) = inf
+.L__log_inf:
+	movdqa	%xmm3,%xmm1
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	p_x(%rsp),%xmm1			# setup the +inf values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+	jmp		.L__f3
+
+
+        .data
+        .align 64
+
+.L__real_zero:				.quad 0x00000000000000000	# 1.0
+					.quad 0x00000000000000000
+.L__real_one:				.quad 0x03f8000003f800000	# 1.0
+					.quad 0x03f8000003f800000
+.L__real_two:				.quad 0x04000000040000000	# 1.0
+					.quad 0x04000000040000000
+.L__real_ninf:				.quad 0x0ff800000ff800000	# -inf
+					.quad 0x0ff800000ff800000
+.L__real_inf:				.quad 0x07f8000007f800000	# +inf
+					.quad 0x07f8000007f800000
+.L__real_nan:				.quad 0x07fc000007fc00000	# NaN
+					.quad 0x07fc000007fc00000
+.L__real_ef:				.quad 0x0402DF854402DF854	# float e
+					.quad 0x0402DF854402DF854
+
+.L__real_sign:				.quad 0x08000000080000000	# sign bit
+					.quad 0x08000000080000000
+.L__real_notsign:			.quad 0x07ffFFFFF7ffFFFFF	# ^sign bit
+					.quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:			.quad 0x00040000000400000	# quiet nan bit
+					.quad 0x00040000000400000
+.L__real_mant:				.quad 0x0007FFFFF007FFFFF	# mantipsa bits
+					.quad 0x0007FFFFF007FFFFF
+.L__real_3c000000:			.quad 0x03c0000003c000000	# /* 0.0078125 = 1/128 */
+					.quad 0x03c0000003c000000
+.L__mask_127:				.quad 0x00000007f0000007f	#
+					.quad 0x00000007f0000007f
+.L__mask_040:				.quad 0x00000004000000040	#
+					.quad 0x00000004000000040
+.L__mask_001:				.quad 0x00000000100000001	#
+					.quad 0x00000000100000001
+
+
+.L__real_threshold:			.quad 0x03CF5C28F3CF5C28F	# .03
+					.quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333317923934e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.25000000037717509602e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3:				.quad 0x03B1249183B124918	# 2.23213998791944806202e-03
+					.quad 0x03B1249183B124918
+.L__real_ca4:				.quad 0x039E401A639E401A6	# 4.34887777707614552256e-04
+					.quad 0x039E401A639E401A6
+.L__real_cb1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333333593622e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.24999999978138668903e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3:				.quad 0x03B124A123B124A12	# 2.23219810758559851206e-03
+					.quad 0x03B124A123B124A12
+.L__real_log2_lead:        		.quad 0x03F3170003F317000  # 0.693115234375
+                        		.quad 0x03F3170003F317000
+.L__real_log2_tail:        		.quad 0x03805FDF43805FDF4  # 0.000031946183
+                        		.quad 0x03805FDF43805FDF4
+.L__real_half:				.quad 0x03f0000003f000000	# 1/2
+					.quad 0x03f0000003f000000
+
+
+.L__np_ln__table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3F8FC0A8B0FC03E4		# 1.55041813850402832031e-02
+	.quad	0x3F9F829B0E783300		# 3.07716131210327148438e-02
+	.quad	0x3FA77458F632DCFC		# 4.58095073699951171875e-02
+	.quad	0x3FAF0A30C01162A6		# 6.06245994567871093750e-02
+	.quad	0x3FB341D7961BD1D1		# 7.52233862876892089844e-02
+	.quad	0x3FB6F0D28AE56B4C		# 8.96121263504028320312e-02
+	.quad	0x3FBA926D3A4AD563		# 1.03796780109405517578e-01
+	.quad	0x3FBE27076E2AF2E6		# 1.17783010005950927734e-01
+	.quad	0x3FC0D77E7CD08E59		# 1.31576299667358398438e-01
+	.quad	0x3FC29552F81FF523		# 1.45181953907012939453e-01
+	.quad	0x3FC44D2B6CCB7D1E		# 1.58604979515075683594e-01
+	.quad	0x3FC5FF3070A793D4		# 1.71850204467773437500e-01
+	.quad	0x3FC7AB890210D909		# 1.84922337532043457031e-01
+	.quad	0x3FC9525A9CF456B4		# 1.97825729846954345703e-01
+	.quad	0x3FCAF3C94E80BFF3		# 2.10564732551574707031e-01
+	.quad	0x3FCC8FF7C79A9A22		# 2.23143517971038818359e-01
+	.quad	0x3FCE27076E2AF2E6		# 2.35566020011901855469e-01
+	.quad	0x3FCFB9186D5E3E2B		# 2.47836112976074218750e-01
+	.quad	0x3FD0A324E27390E3		# 2.59957492351531982422e-01
+	.quad	0x3FD1675CABABA60E		# 2.71933674812316894531e-01
+	.quad	0x3FD22941FBCF7966		# 2.83768117427825927734e-01
+	.quad	0x3FD2E8E2BAE11D31		# 2.95464158058166503906e-01
+	.quad	0x3FD3A64C556945EA		# 3.07025015354156494141e-01
+	.quad	0x3FD4618BC21C5EC2		# 3.18453729152679443359e-01
+	.quad	0x3FD51AAD872DF82D		# 3.29753279685974121094e-01
+	.quad	0x3FD5D1BDBF5809CA		# 3.40926527976989746094e-01
+	.quad	0x3FD686C81E9B14AF		# 3.51976394653320312500e-01
+	.quad	0x3FD739D7F6BBD007		# 3.62905442714691162109e-01
+	.quad	0x3FD7EAF83B82AFC3		# 3.73716354370117187500e-01
+	.quad	0x3FD89A3386C1425B		# 3.84411692619323730469e-01
+	.quad	0x3FD947941C2116FB		# 3.94993782043457031250e-01
+	.quad	0x3FD9F323ECBF984C		# 4.05465066432952880859e-01
+	.quad	0x3FDA9CEC9A9A084A		# 4.15827870368957519531e-01
+	.quad	0x3FDB44F77BCC8F63		# 4.26084339618682861328e-01
+	.quad	0x3FDBEB4D9DA71B7C		# 4.36236739158630371094e-01
+	.quad	0x3FDC8FF7C79A9A22		# 4.46287095546722412109e-01
+	.quad	0x3FDD32FE7E00EBD5		# 4.56237375736236572266e-01
+	.quad	0x3FDDD46A04C1C4A1		# 4.66089725494384765625e-01
+	.quad	0x3FDE744261D68788		# 4.75845873355865478516e-01
+	.quad	0x3FDF128F5FAF06ED		# 4.85507786273956298828e-01
+	.quad	0x3FDFAF588F78F31F		# 4.95077252388000488281e-01
+	.quad	0x3FE02552A5A5D0FF		# 5.04556000232696533203e-01
+	.quad	0x3FE0723E5C1CDF40		# 5.13945698738098144531e-01
+	.quad	0x3FE0BE72E4252A83		# 5.23248136043548583984e-01
+	.quad	0x3FE109F39E2D4C97		# 5.32464742660522460938e-01
+	.quad	0x3FE154C3D2F4D5EA		# 5.41597247123718261719e-01
+	.quad	0x3FE19EE6B467C96F		# 5.50647079944610595703e-01
+	.quad	0x3FE1E85F5E7040D0		# 5.59615731239318847656e-01
+	.quad	0x3FE23130D7BEBF43		# 5.68504691123962402344e-01
+	.quad	0x3FE2795E1289B11B		# 5.77315330505371093750e-01
+	.quad	0x3FE2C0E9ED448E8C		# 5.86049020290374755859e-01
+	.quad	0x3FE307D7334F10BE		# 5.94707071781158447266e-01
+	.quad	0x3FE34E289D9CE1D3		# 6.03290796279907226562e-01
+	.quad	0x3FE393E0D3562A1A		# 6.11801505088806152344e-01
+	.quad	0x3FE3D9026A7156FB		# 6.20240390300750732422e-01
+	.quad	0x3FE41D8FE84672AE		# 6.28608644008636474609e-01
+	.quad	0x3FE4618BC21C5EC2		# 6.36907458305358886719e-01
+	.quad	0x3FE4A4F85DB03EBB		# 6.45137906074523925781e-01
+	.quad	0x3FE4E7D811B75BB1		# 6.53301239013671875000e-01
+	.quad	0x3FE52A2D265BC5AB		# 6.61398470401763916016e-01
+	.quad	0x3FE56BF9D5B3F399		# 6.69430613517761230469e-01
+	.quad	0x3FE5AD404C359F2D		# 6.77398800849914550781e-01
+	.quad	0x3FE5EE02A9241675		# 6.85303986072540283203e-01
+	.quad	0x3FE62E42FEFA39EF		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_lead_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x3C7E0000  # 0.015502929688 1
+    .long 0x3CFC1000  # 0.030769348145 2
+    .long 0x3D3BA000  # 0.045806884766 3
+    .long 0x3D785000  # 0.060623168945 4
+    .long 0x3D9A0000  # 0.075195312500 5
+    .long 0x3DB78000  # 0.089599609375 6
+    .long 0x3DD49000  # 0.103790283203 7
+    .long 0x3DF13000  # 0.117767333984 8
+    .long 0x3E06B000  # 0.131530761719 9
+    .long 0x3E14A000  # 0.145141601563 10
+    .long 0x3E226000  # 0.158569335938 11
+    .long 0x3E2FF000  # 0.171813964844 12
+    .long 0x3E3D5000  # 0.184875488281 13
+    .long 0x3E4A9000  # 0.197814941406 14
+    .long 0x3E579000  # 0.210510253906 15
+    .long 0x3E647000  # 0.223083496094 16
+    .long 0x3E713000  # 0.235534667969 17
+    .long 0x3E7DC000  # 0.247802734375 18
+    .long 0x3E851000  # 0.259887695313 19
+    .long 0x3E8B3000  # 0.271850585938 20
+    .long 0x3E914000  # 0.283691406250 21
+    .long 0x3E974000  # 0.295410156250 22
+    .long 0x3E9D3000  # 0.307006835938 23
+    .long 0x3EA30000  # 0.318359375000 24
+    .long 0x3EA8D000  # 0.329711914063 25
+    .long 0x3EAE8000  # 0.340820312500 26
+    .long 0x3EB43000  # 0.351928710938 27
+    .long 0x3EB9C000  # 0.362792968750 28
+    .long 0x3EBF5000  # 0.373657226563 29
+    .long 0x3EC4D000  # 0.384399414063 30
+    .long 0x3ECA3000  # 0.394897460938 31
+    .long 0x3ECF9000  # 0.405395507813 32
+    .long 0x3ED4E000  # 0.415771484375 33
+    .long 0x3EDA2000  # 0.426025390625 34
+    .long 0x3EDF5000  # 0.436157226563 35
+    .long 0x3EE47000  # 0.446166992188 36
+    .long 0x3EE99000  # 0.456176757813 37
+    .long 0x3EEEA000  # 0.466064453125 38
+    .long 0x3EF3A000  # 0.475830078125 39
+    .long 0x3EF89000  # 0.485473632813 40
+    .long 0x3EFD7000  # 0.494995117188 41
+    .long 0x3F012000  # 0.504394531250 42
+    .long 0x3F039000  # 0.513916015625 43
+    .long 0x3F05F000  # 0.523193359375 44
+    .long 0x3F084000  # 0.532226562500 45
+    .long 0x3F0AA000  # 0.541503906250 46
+    .long 0x3F0CF000  # 0.550537109375 47
+    .long 0x3F0F4000  # 0.559570312500 48
+    .long 0x3F118000  # 0.568359375000 49
+    .long 0x3F13C000  # 0.577148437500 50
+    .long 0x3F160000  # 0.585937500000 51
+    .long 0x3F183000  # 0.594482421875 52
+    .long 0x3F1A7000  # 0.603271484375 53
+    .long 0x3F1C9000  # 0.611572265625 54
+    .long 0x3F1EC000  # 0.620117187500 55
+    .long 0x3F20E000  # 0.628417968750 56
+    .long 0x3F230000  # 0.636718750000 57
+    .long 0x3F252000  # 0.645019531250 58
+    .long 0x3F273000  # 0.653076171875 59
+    .long 0x3F295000  # 0.661376953125 60
+    .long 0x3F2B5000  # 0.669189453125 61
+    .long 0x3F2D6000  # 0.677246093750 62
+    .long 0x3F2F7000  # 0.685302734375 63
+    .long 0x3F317000  # 0.693115234375 64
+    .long 0					# for alignment
+
+.L__np_ln_tail_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x35A8B0FC  # 0.000001256848 1
+    .long 0x361B0E78  # 0.000002310522 2
+    .long 0x3631EC66  # 0.000002651266 3
+    .long 0x35C30046  # 0.000001452871 4
+    .long 0x37EBCB0E  # 0.000028108738 5
+    .long 0x37528AE5  # 0.000012549314 6
+    .long 0x36DA7496  # 0.000006510479 7
+    .long 0x3783B715  # 0.000015701671 8
+    .long 0x383F3E68  # 0.000045596069 9
+    .long 0x38297C10  # 0.000040408282 10
+    .long 0x3815B666  # 0.000035694240 11
+    .long 0x38183854  # 0.000036292084 12
+    .long 0x38448108  # 0.000046850211 13
+    .long 0x373539E9  # 0.000010801924 14
+    .long 0x3864A740  # 0.000054515200 15
+    .long 0x387BE3CD  # 0.000060055219 16
+    .long 0x3803B715  # 0.000031403342 17
+    .long 0x380C36AF  # 0.000033429529 18
+    .long 0x3892713A  # 0.000069829126 19
+    .long 0x38AE55D6  # 0.000083129547 20
+    .long 0x38A0FDE8  # 0.000076766883 21
+    .long 0x3862BAE1  # 0.000054056643 22
+    .long 0x3798AAD3  # 0.000018199358 23
+    .long 0x38C5E10E  # 0.000094356117 24
+    .long 0x382D872E  # 0.000041372310 25
+    .long 0x38DEDFAC  # 0.000106274470 26
+    .long 0x38481E9B  # 0.000047712219 27
+    .long 0x38EBFB5E  # 0.000112524940 28
+    .long 0x38783B83  # 0.000059183232 29
+    .long 0x374E1B05  # 0.000012284848 30
+    .long 0x38CA0E11  # 0.000096347307 31
+    .long 0x3891F660  # 0.000069600297 32
+    .long 0x386C9A9A  # 0.000056410769 33
+    .long 0x38777BCD  # 0.000059004688 34
+    .long 0x38A6CED4  # 0.000079540216 35
+    .long 0x38FBE3CD  # 0.000120110439 36
+    .long 0x387E7E01  # 0.000060675669 37
+    .long 0x37D40984  # 0.000025276800 38
+    .long 0x3784C3AD  # 0.000015826745 39
+    .long 0x380F5FAF  # 0.000034182969 40
+    .long 0x38AC47BC  # 0.000082149607 41
+    .long 0x392952D3  # 0.000161479504 42
+    .long 0x37F97073  # 0.000029735476 43
+    .long 0x3865C84A  # 0.000054784388 44
+    .long 0x3979CF17  # 0.000238236375 45
+    .long 0x38C3D2F5  # 0.000093376184 46
+    .long 0x38E6B468  # 0.000110008579 47
+    .long 0x383EBCE1  # 0.000045475437 48
+    .long 0x39186BDF  # 0.000145360347 49
+    .long 0x392F0945  # 0.000166927537 50
+    .long 0x38E9ED45  # 0.000111545007 51
+    .long 0x396B99A8  # 0.000224685878 52
+    .long 0x37A27674  # 0.000019367064 53
+    .long 0x397069AB  # 0.000229275480 54
+    .long 0x39013539  # 0.000123222257 55
+    .long 0x3947F423  # 0.000190690669 56
+    .long 0x3945E10E  # 0.000188712234 57
+    .long 0x38F85DB0  # 0.000118430122 58
+    .long 0x396C08DC  # 0.000225100142 59
+    .long 0x37B4996F  # 0.000021529120 60
+    .long 0x397CEADA  # 0.000241200818 61
+    .long 0x3920261B  # 0.000152729845 62
+    .long 0x35AA4906  # 0.000001268724 63
+    .long 0x3805FDF4  # 0.000031946183 64
+    .long 0					# for alignment
+

diff --git a/src/gas/vrs4powf.S b/src/gas/vrs4powf.S
new file mode 100644
index 0000000..42b005d
--- /dev/null
+++ b/src/gas/vrs4powf.S

@@ -0,0 +1,623 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4powf.s
+#
+# A vector implementation of the powf libm function.
+#
+# Prototype:
+#
+#     __m128 __vrs4_powf(__m128 x,__m128 y);
+#
+#   Computes x raised to the y power.  Returns proper C99 values.
+#   Uses new tuned fastlog/fastexp.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ	p_temp,0x00		# xmmword
+.equ	p_negateres,0x10		# qword
+
+.equ	p_xexp,0x20		# qword
+
+.equ	p_ux,0x030		# storage for X
+.equ	p_uy,0x040		# storage for Y
+
+.equ	p_ax,0x050		# absolute x
+.equ	p_sx,0x060		# sign of x's
+
+.equ	p_ay,0x070		# absolute y
+.equ	p_yexp,0x080		# unbiased exponent of y
+
+.equ	p_inty,0x090		# integer y indicators
+.equ    save_rbx,0x0A0          #
+
+.equ	stack_size,0x0B8	# allocate 40h more than
+				# we need to avoid bank conflicts
+
+
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrs4_powf
+    .type   __vrs4_powf,@function
+__vrs4_powf:
+	sub		$stack_size,%rsp
+        mov             %rbx,save_rbx(%rsp)     # save rbx
+
+	movaps	  %xmm0,p_ux(%rsp)		# save x
+	movaps	  %xmm1,p_uy(%rsp)		# save y
+
+	movaps	%xmm0,%xmm2
+	andps	.L__mask_nsign(%rip),%xmm0		# get abs x
+	andps	.L__mask_sign(%rip),%xmm2		# mask for the sign bits
+	movaps	  %xmm0,p_ax(%rsp)		# save them
+	movaps	  %xmm2,p_sx(%rsp)		# save them
+# convert all four x's to double
+	cvtps2pd   p_ax(%rsp),%xmm0
+	cvtps2pd   p_ax+8(%rsp),%xmm1
+#
+# classify y
+# vector 32 bit integer method	 25 cycles to here
+#  /* See whether y is an integer.
+#     inty = 0 means not an integer.
+#     inty = 1 means odd integer.
+#     inty = 2 means even integer.
+#  */
+	movdqa  p_uy(%rsp),%xmm4
+	pxor	%xmm3,%xmm3
+	pand	.L__mask_nsign(%rip),%xmm4		# get abs y in integer format
+	movdqa    %xmm4,p_ay(%rsp)			# save it
+
+# see if the number is less than 1.0
+	psrld	$23,%xmm4			#>> EXPSHIFTBITS_SP32
+
+	psubd	.L__mask_127(%rip),%xmm4			# yexp, unbiased exponent
+	movdqa    %xmm4,p_yexp(%rsp)		# save it
+	paddd	.L__mask_1(%rip),%xmm4			# yexp+1
+	pcmpgtd	%xmm3,%xmm4		# 0 if exp less than 126 (2^0) (y < 1.0), else FFs
+# xmm4 is ffs if abs(y) >=1.0, else 0
+
+# see if the mantissa has fractional bits
+#build mask for mantissa
+	movdqa  .L__mask_23(%rip),%xmm2
+	psubd	p_yexp(%rsp),%xmm2		# 24-yexp
+	pmaxsw	%xmm3,%xmm2							# no shift counts less than 0
+	movdqa    %xmm2,p_temp(%rsp)		# save the shift counts
+# create mask for all four values
+# SSE can't individual shifts so have to do 0xeac one seperately
+	mov		p_temp(%rsp),%rcx
+	mov		$1,%rbx
+	shl		%cl,%ebx			#1 << (24 - yexp)
+	shr		$32,%rcx
+	mov		$1,%eax
+	shl		%cl,%eax			#1 << (24 - yexp)
+	shl		$32,%rax
+	add		%rax,%rbx
+	mov		%rbx,p_temp(%rsp)
+	mov		p_temp+8(%rsp),%rcx
+	mov		$1,%rbx
+	shl		%cl,%ebx			#1 << (24 - yexp)
+	shr		$32,%rcx
+	mov		$1,%eax
+	shl		%cl,%eax			#1 << (24 - yexp)
+	shl		$32,%rax
+	add		%rbx,%rax
+	mov		%rax,p_temp+8(%rsp)
+	movdqa  p_temp(%rsp),%xmm5
+	psubd	.L__mask_1(%rip),%xmm5	#= mask = (1 << (24 - yexp)) - 1
+
+# now use the mask to see if there are any fractional bits
+	movdqa  p_uy(%rsp),%xmm2 # get uy
+	pand	%xmm5,%xmm2		# uy & mask
+	pcmpeqd	%xmm3,%xmm2		# 0 if not zero (y has fractional mantissa bits), else FFs
+	pand	%xmm4,%xmm2		# either 0s or ff
+# xmm2 now accounts for y< 1.0 or y>=1.0 and y has fractional mantissa bits,
+# it has the value 0 if we know it's non-integer or ff if integer.
+
+# now see if it's even or odd.
+
+## if yexp > 24, then it has to be even
+	movdqa  .L__mask_24(%rip),%xmm4
+	psubd	p_yexp(%rsp),%xmm4		# 24-yexp
+	paddd	.L__mask_1(%rip),%xmm5	# mask+1 = least significant integer bit
+	pcmpgtd	%xmm3,%xmm4		 ## if 0, then must be even, else ff's
+
+ 	pand	%xmm4,%xmm5		# set the integer bit mask to zero if yexp>24
+ 	paddd	.L__mask_2(%rip),%xmm4
+ 	por		.L__mask_2(%rip),%xmm4
+ 	pand	%xmm2,%xmm4		 # result can be 0, 2, or 3
+
+# now for integer numbers, see if odd or even
+	pand	.L__mask_mant(%rip),%xmm5	# mask out exponent bits
+	movdqa	.L__float_one(%rip),%xmm2
+	pand    p_uy(%rsp),%xmm5 #  & uy -> even or odd
+	pcmpeqd	p_ay(%rsp),%xmm2	# is ay equal to 1, ff's if so, then it's odd
+	pand	.L__mask_nsign(%rip),%xmm2 # strip the sign bit so the gt comparison works.
+	por		%xmm2,%xmm5
+	pcmpgtd	%xmm3,%xmm5		 ## if odd then ff's, else 0's for even
+	paddd	.L__mask_2(%rip),%xmm5 # gives us 2 for even, 1 for odd
+	pand	%xmm5,%xmm4
+
+	movdqa		  %xmm4,p_inty(%rsp)		# save inty
+#
+# do more x special case checking
+#
+	movdqa	%xmm4,%xmm5
+	pcmpeqd	%xmm3,%xmm5						# is not an integer? ff's if so
+	pand	.L__mask_NaN(%rip),%xmm5		# these values will be NaNs, if x<0
+	movdqa	%xmm4,%xmm2
+	pcmpeqd	.L__mask_1(%rip),%xmm2		# is it odd? ff's if so
+	pand	.L__mask_sign(%rip),%xmm2	# these values will get their sign bit set
+	por		%xmm2,%xmm5
+
+	pcmpeqd	p_sx(%rsp),%xmm3		## if the signs are set
+	pandn	%xmm5,%xmm3						# then negateres gets the values as shown below
+	movdqa	  %xmm3,p_negateres(%rsp)	# save negateres
+
+#  /* p_negateres now means the following.
+#  ** 7FC00000 means x<0, y not an integer, return NaN.
+#  ** 80000000 means x<0, y is odd integer, so set the sign bit.
+#  ** 0 means even integer, and/or x>=0.
+#  */
+
+
+# **** Here starts the main calculations  ****
+# The algorithm used is x**y = exp(y*log(x))
+#  Extra precision is required in intermediate steps to meet the 1ulp requirement
+#
+# log(x) calculation
+	call		__vrd4_log@PLT					# get the double precision log value
+											# for all four x's
+# y* logx
+# convert all four y's to double
+	lea	p_uy(%rsp),%rdx		# get pointer to y
+	cvtps2pd   (%rdx),%xmm2
+	cvtps2pd   8(%rdx),%xmm3
+
+#  /* just multiply by y */
+	mulpd	%xmm2,%xmm0
+	mulpd	%xmm3,%xmm1
+
+#  /* The following code computes r = exp(w) */
+	call		__vrd4_exp@PLT					# get the double exp value
+											# for all four y*log(x)'s
+#
+# convert all four results to double
+	cvtpd2ps	%xmm0,%xmm0
+	cvtpd2ps	%xmm1,%xmm1
+	movlhps		%xmm1,%xmm0
+
+# perform special case and error checking on input values
+
+# special case checking is done first in the scalar version since
+# it allows for early fast returns.  But for vectors, we consider them
+# to be rare, so early returns are not necessary.  So we first compute
+# the x**y values, and then check for special cases.
+
+# we do some of the checking in reverse order of the scalar version.
+	lea	p_uy(%rsp),%rdx		# get pointer to y
+# apply the negate result flags
+	orps	p_negateres(%rsp),%xmm0	# get negateres
+
+## if y is infinite or so large that the result would overflow or underflow
+	movdqa	p_ay(%rsp),%xmm4
+	cmpps	$5,.L__mask_ly(%rip),%xmm4	# y not less than large value, ffs if so.
+	movmskps %xmm4,%edx
+	test	$0x0f,%edx
+	jnz		.Ly_large
+.Lrnsx3:
+
+## if x is infinite
+	movdqa	p_ax(%rsp),%xmm4
+	cmpps	$0,.L__mask_inf(%rip),%xmm4	# equal to infinity, ffs if so.
+	movmskps %xmm4,%edx
+	test	$0x0f,%edx
+	jnz		.Lx_infinite
+.Lrnsx1:
+## if x is zero
+	xorps	%xmm4,%xmm4
+	cmpps	$0,p_ax(%rsp),%xmm4	# equal to zero, ffs if so.
+	movmskps %xmm4,%edx
+	test	$0x0f,%edx
+	jnz		.Lx_zero
+.Lrnsx2:
+## if y is NAN
+	lea		p_uy(%rsp),%rdx		# get pointer to y
+	movdqa	(%rdx),%xmm4			# get y
+	cmpps	$4,%xmm4,%xmm4						# a compare not equal  of y to itself should
+											# be false, unless y is a NaN. ff's if NaN.
+	movmskps %xmm4,%ecx
+	test	$0x0f,%ecx
+	jnz		.Ly_NaN
+.Lrnsx4:
+## if x is NAN
+	lea		p_ux(%rsp),%rdx		# get pointer to x
+	movdqa	(%rdx),%xmm4			# get x
+	cmpps	$4,%xmm4,%xmm4						# a compare not equal  of x to itself should
+											# be false, unless x is a NaN. ff's if NaN.
+	movmskps %xmm4,%ecx
+	test	$0x0f,%ecx
+	jnz		.Lx_NaN
+.Lrnsx5:
+
+## if |y| == 0	then return 1
+	movdqa	.L__float_one(%rip),%xmm3	# one
+	xorps	%xmm2,%xmm2
+	cmpps	$4,p_ay(%rsp),%xmm2	# not equal to 0.0?, ffs if not equal.
+	andps	%xmm2,%xmm0						# keep the others
+	andnps	%xmm3,%xmm2						# mask for ones
+	orps	%xmm2,%xmm0
+## if x == +1, return +1 for all x
+	lea		p_ux(%rsp),%rdx		# get pointer to x
+	movdqa	%xmm3,%xmm2
+	cmpps	$4,(%rdx),%xmm2		# not equal to +1.0?, ffs if not equal.
+	andps	%xmm2,%xmm0						# keep the others
+	andnps	%xmm3,%xmm2						# mask for ones
+	orps	%xmm2,%xmm0
+
+.L__powf_cleanup2:
+        mov             save_rbx(%rsp),%rbx             # restore rbx
+	add		$stack_size,%rsp
+	ret
+
+	.align 16
+#       y is a NaN.
+.Ly_NaN:
+	lea	p_uy(%rsp),%rdx		# get pointer to y
+	movdqa	(%rdx),%xmm4			# get y
+	movdqa	%xmm4,%xmm3
+	movdqa	%xmm4,%xmm5
+	movdqa	.L__mask_sigbit(%rip),%xmm2	# get the signalling bits
+	cmpps	$0,%xmm4,%xmm4						# a compare equal  of y to itself should
+											# be true, unless y is a NaN. 0's if NaN.
+	cmpps	$4,%xmm3,%xmm3						# compare not equal, ff's if NaN.
+	andps	%xmm4,%xmm0						# keep the other results
+	andps	%xmm3,%xmm2						# get just the right signalling bits
+	andps	%xmm5,%xmm3						# mask for the NaNs
+	orps	%xmm2,%xmm3	# convert to QNaNs
+	orps	%xmm3,%xmm0						# combine
+	jmp	   	.Lrnsx4
+
+#       y is a NaN.
+.Lx_NaN:
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	movdqa	(%rcx),%xmm4			# get x
+	movdqa	%xmm4,%xmm3
+	movdqa	%xmm4,%xmm5
+	movdqa	.L__mask_sigbit(%rip),%xmm2	# get the signalling bits
+	cmpps	$0,%xmm4,%xmm4						# a compare equal  of x to itself should
+											# be true, unless x is a NaN. 0's if NaN.
+	cmpps	$4,%xmm3,%xmm3						# compare not equal, ff's if NaN.
+	andps	%xmm4,%xmm0						# keep the other results
+	andps	%xmm3,%xmm2						# get just the right signalling bits
+	andps	%xmm5,%xmm3						# mask for the NaNs
+	orps	%xmm2,%xmm3	# convert to QNaNs
+	orps	%xmm3,%xmm0						# combine
+	jmp	   	.Lrnsx5
+
+#      * y is infinite or so large that the result would
+#         overflow or underflow.
+.Ly_large:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	test	$1,%edx
+	jz		.Lylrga
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		(%rcx),%eax
+	mov		(%rbx),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+.Lylrga:
+	test	$2,%edx
+	jz		.Lylrgb
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		4(%rcx),%eax
+	mov		4(%rbx),%ebx
+	mov		p_inty+4(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+.Lylrgb:
+	test	$4,%edx
+	jz		.Lylrgc
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		8(%rcx),%eax
+	mov		8(%rbx),%ebx
+	mov		p_inty+8(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+.Lylrgc:
+	test	$8,%edx
+	jz		.Lylrgd
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		12(%rcx),%eax
+	mov		12(%rbx),%ebx
+	mov		p_inty+12(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+.Lylrgd:
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx3
+
+# a subroutine to treat an individual x,y pair when y is large or infinity
+# assumes x in .Ly,%eax in ebx.
+# returns result in eax
+.Lnp_special6:
+# handle |x|==1 cases first
+	mov		$0x07FFFFFFF,%r8d
+	and		%eax,%r8d
+	cmp		$0x03f800000,%r8d					  # jump if |x| !=1
+	jnz		.Lnps6
+	mov		$0x03f800000,%eax					  # return 1 for all |x|==1
+	jmp 	.Lnpx64
+
+# cases where  |x| !=1
+.Lnps6:
+	mov		$0x07f800000,%ecx
+	xor		%eax,%eax							  # assume 0 return
+	test	$0x080000000,%ebx
+	jnz		.Lnps62							  # jump if y negative
+# y = +inf
+	cmp		$0x03f800000,%r8d
+	cmovg	%ecx,%eax							  # return inf if |x| < 1
+	jmp 	.Lnpx64
+.Lnps62:
+# y = -inf
+	cmp		$0x03f800000,%r8d
+	cmovl	%ecx,%eax							  # return inf if |x| < 1
+	jmp 	.Lnpx64
+
+.Lnpx64:
+	ret
+
+# handle cases where x is +/- infinity.  edx is the mask
+	.align 16
+.Lx_infinite:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	test	$1,%edx
+	jz		.Lxinfa
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		(%rcx),%eax
+	mov		(%rbx),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+.Lxinfa:
+	test	$2,%edx
+	jz		.Lxinfb
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		4(%rcx),%eax
+	mov		4(%rbx),%ebx
+	mov		p_inty+4(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+.Lxinfb:
+	test	$4,%edx
+	jz		.Lxinfc
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		8(%rcx),%eax
+	mov		8(%rbx),%ebx
+	mov		p_inty+8(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+.Lxinfc:
+	test	$8,%edx
+	jz		.Lxinfd
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		12(%rcx),%eax
+	mov		12(%rbx),%ebx
+	mov		p_inty+12(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+.Lxinfd:
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx1
+
+# a subroutine to treat an individual x,y pair when x is +/-infinity
+# assumes x in .Ly,%eax in ebx, inty in ecx.
+# returns result in eax
+.Lnp_special_x1:											# x is infinite
+	test	$0x080000000,%eax								# is x positive
+	jnz		.Lnsx11										# jump if not
+	test	$0x080000000,%ebx								# is y positive
+	jz		.Lnsx13											# just return if so
+	xor		%eax,%eax										# else return 0
+	jmp 	.Lnsx13
+
+.Lnsx11:
+	cmp		$1,%ecx										## if inty ==1
+	jnz		.Lnsx12										# jump if not
+	test	$0x080000000,%ebx								# is y positive
+	jz		.Lnsx13											# just return if so
+	mov		$0x080000000,%eax								# else return -0
+	jmp 	.Lnsx13
+.Lnsx12:													# inty <>1
+	and		$0x07FFFFFFF,%eax								# return -x (|x|)  if y<0
+	test	$0x080000000,%ebx								# is y positive
+	jz		.Lnsx13											#
+	xor		%eax,%eax										# return 0  if y >=0
+.Lnsx13:
+	ret
+
+
+# handle cases where x is +/- zero.  edx is the mask of x,y pairs with |x|=0
+	.align 16
+.Lx_zero:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	test	$1,%edx
+	jz		.Lxzera
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		(%rcx),%eax
+	mov		(%rbx),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+.Lxzera:
+	test	$2,%edx
+	jz		.Lxzerb
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		4(%rcx),%eax
+	mov		4(%rbx),%ebx
+	mov		p_inty+4(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+.Lxzerb:
+	test	$4,%edx
+	jz		.Lxzerc
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		8(%rcx),%eax
+	mov		8(%rbx),%ebx
+	mov		p_inty+8(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+.Lxzerc:
+	test	$8,%edx
+	jz		.Lxzerd
+	lea		p_ux(%rsp),%rcx		# get pointer to x
+	lea		p_uy(%rsp),%rbx		# get pointer to y
+	mov		12(%rcx),%eax
+	mov		12(%rbx),%ebx
+	mov		p_inty+12(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+.Lxzerd:
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx2
+
+# a subroutine to treat an individual x,y pair when x is +/-0
+# assumes x in .Ly,%eax in ebx, inty in ecx.
+# returns result in eax
+	.align 16
+.Lnp_special_x2:
+	cmp		$1,%ecx										## if inty ==1
+	jz		.Lnsx21										# jump if so
+# handle cases of x=+/-0, y not integer
+	xor		%eax,%eax
+	mov		$0x07f800000,%ecx
+	test	$0x080000000,%ebx								# is ypos
+	cmovnz	%ecx,%eax
+	jmp		.Lnsx23
+# y is an integer
+.Lnsx21:
+	xor		%r8d,%r8d
+	mov		$0x07f800000,%ecx
+	test	$0x080000000,%ebx								# is ypos
+	cmovnz	%ecx,%r8d										# set to infinity if not
+	and		$0x080000000,%eax								# pickup the sign of x
+	or		%r8d,%eax										# and include it in the result
+.Lnsx23:
+	ret
+
+
+        .data
+        .align 64
+
+.L__mask_sign:			.quad 0x08000000080000000	# a sign bit mask
+				.quad 0x08000000080000000
+
+.L__mask_nsign:			.quad 0x07FFFFFFF7FFFFFFF	# a not sign bit mask
+				.quad 0x07FFFFFFF7FFFFFFF
+
+# used by inty
+.L__mask_127:			.quad 0x00000007F0000007F	# EXPBIAS_SP32
+				.quad 0x00000007F0000007F
+
+.L__mask_mant:			.quad 0x0007FFFFF007FFFFF	# mantissa bit mask
+				.quad 0x0007FFFFF007FFFFF
+
+.L__mask_1:			.quad 0x00000000100000001	# 1
+				.quad 0x00000000100000001
+
+.L__mask_2:			.quad 0x00000000200000002	# 2
+				.quad 0x00000000200000002
+
+.L__mask_24:			.quad 0x00000001800000018	# 24
+				.quad 0x00000001800000018
+
+.L__mask_23:			.quad 0x00000001700000017	# 23
+				.quad 0x00000001700000017
+
+# used by special case checking
+
+.L__float_one:			.quad 0x03f8000003f800000	# one
+				.quad 0x03f8000003f800000
+
+.L__mask_inf:			.quad 0x07f8000007F800000	# inifinity
+				.quad 0x07f8000007F800000
+
+.L__mask_NaN:			.quad 0x07fC000007FC00000	# NaN
+				.quad 0x07fC000007FC00000
+
+.L__mask_sigbit:		.quad 0x00040000000400000	# QNaN bit
+				.quad 0x00040000000400000
+
+.L__mask_ly:			.quad 0x04f0000004f000000	# large y
+				.quad 0x04f0000004f000000

diff --git a/src/gas/vrs4powxf.S b/src/gas/vrs4powxf.S
new file mode 100644
index 0000000..e18b5db
--- /dev/null
+++ b/src/gas/vrs4powxf.S

@@ -0,0 +1,538 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4powxf.asm
+#
+# A vector implementation of the powf libm function.
+# This routine raises the x vector to a constant y power.
+#
+# Prototype:
+#
+#     __m128 __vrs4_powxf(__m128 x,float y);
+#
+#   Computes x raised to the y power.  Returns proper C99 values.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+
+# define local variable storage offsets
+.equ	p_temp,0x00		# xmmword
+.equ	p_negateres,0x10	# qword
+
+.equ	save_rbx,0x020		#qword
+.equ	save_rsi,0x028		#qword
+
+.equ	p_xptr,0x030		# ptr to x values
+.equ	p_y,0x038		# y value
+
+.equ	p_inty,0x040		# integer y indicators
+
+.equ	p_ux,0x050		# absolute x
+.equ	p_ax,0x060		# absolute x
+.equ	p_sx,0x070		# sign of x's
+
+.equ	stack_size,0x088	#
+
+
+
+
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrs4_powxf
+    .type   __vrs4_powxf,@function
+__vrs4_powxf:
+
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+
+	lea		p_ux(%rsp),%rcx
+	mov		  %rcx,p_xptr(%rsp)		# save pointer to x
+	movaps		%xmm0,(%rcx)
+	movss	  %xmm1,p_y(%rsp)		# save y
+
+	movdqa	%xmm1,%xmm4
+
+	movaps	%xmm0,%xmm2
+	andps	.L__mask_nsign(%rip),%xmm0		# get abs x
+	andps	.L__mask_sign(%rip),%xmm2			# mask for the sign bits
+	movaps	  %xmm0,p_ax(%rsp)		# save them
+	movaps	  %xmm2,p_sx(%rsp)		# save them
+# convert all four x's to double
+	cvtps2pd   p_ax(%rsp),%xmm0
+	cvtps2pd   p_ax+8(%rsp),%xmm1
+#
+# classify y
+# vector 32 bit integer method	 25 cycles to here
+#  /* See whether y is an integer.
+#     inty = 0 means not an integer.
+#  */
+# get yexp
+	mov		p_y(%rsp),%r8d						# r8 is uy
+	mov		$0x07fffffff,%r9d
+	and		%r8d,%r9d						# r9 is ay
+
+## if |y| == 0	then return 1
+	cmp		$0,%r9d							# is y a zero?
+	jz		.Ly_zero
+
+	mov		$0x07f800000,%eax				# EXPBITS_SP32
+	and		%r9d,%eax						# y exp
+
+	xor		%edi,%edi
+	shr		$23,%eax			#>> EXPSHIFTBITS_SP32
+	sub		$126,%eax		# - EXPBIAS_SP32 + 1   - eax is now the unbiased exponent
+	mov		$1,%ebx
+	cmp		%ebx,%eax			## if (yexp < 1)
+	cmovl	%edi,%ebx
+	jl		.Lsave_inty
+
+	mov		$24,%ecx
+	cmp		%ecx,%eax			## if (yexp >24)
+	jle		.Linfy1
+	mov		$2,%ebx
+	jmp		.Lsave_inty
+.Linfy1:						# else 1<=yexp<=24
+	sub		%eax,%ecx			# build mask for mantissa
+	shl		%cl,%ebx
+	dec		%ebx				# rbx = mask = (1 << (24 - yexp)) - 1
+
+	mov		%r8d,%eax
+	and		%ebx,%eax			## if ((uy & mask) != 0)
+	cmovnz	%edi,%ebx			#   inty = 0;
+	jnz		.Lsave_inty
+
+	not		%ebx				# else if (((uy & ~mask) >> (24 - yexp)) & 0x00000001)
+	mov		%r8d,%eax
+	and		%ebx,%eax
+	shr		%cl,%eax
+	inc		%edi
+	and		%edi,%eax
+	mov		%edi,%ebx			#  inty = 1
+	jnz		.Lsave_inty
+	inc		%ebx				# else	inty = 2
+
+
+.Lsave_inty:
+	mov		  %r8d,p_y+4(%rsp)						# r8d is ay
+	mov		  %ebx,p_inty(%rsp)		# save inty
+#
+# do more x special case checking
+#
+	pxor	%xmm3,%xmm3
+	xor		%eax,%eax
+	mov		$0x07FC00000,%ecx
+	cmp		$0,%ebx							# is y not an integer?
+	cmovz	%ecx,%eax							# then set to return a NaN.  else 0.
+	mov		$0x080000000,%ecx
+	cmp		$1,%ebx							# is y an odd integer?
+	cmovz	%ecx,%eax							# maybe set sign bit if so
+	movd	%eax,%xmm5
+	pshufd	$0,%xmm5,%xmm5
+
+	pcmpeqd	p_sx(%rsp),%xmm3		## if the signs are set
+	pandn	%xmm5,%xmm3						# then negateres gets the values as shown below
+	movdqa	  %xmm3,p_negateres(%rsp)	# save negateres
+
+#  /* p_negateres now means the following.
+#     7FC00000 means x<0, y not an integer, return NaN.
+#     80000000 means x<0, y is odd integer, so set the sign bit.
+##     0 means even integer, and/or x>=0.
+#  */
+
+# **** Here starts the main calculations  ****
+# The algorithm used is x**y = exp(y*log(x))
+#  Extra precision is required in intermediate steps to meet the 1ulp requirement
+#
+# log(x) calculation
+	call		__vrd4_log@PLT					# get the double precision log value
+											# for all four x's
+# y* logx
+	cvtps2pd   p_y(%rsp),%xmm2		#convert the two packed single y's to double
+
+#  /* just multiply by y */
+	mulpd	%xmm2,%xmm0
+	mulpd	%xmm2,%xmm1
+
+#  /* The following code computes r = exp(w) */
+	call		__vrd4_exp@PLT					# get the double exp value
+											# for all four y*log(x)'s
+#
+# convert all four results to double
+	cvtpd2ps	%xmm0,%xmm0
+	cvtpd2ps	%xmm1,%xmm1
+	movlhps		%xmm1,%xmm0
+
+# perform special case and error checking on input values
+
+# special case checking is done first in the scalar version since
+# it allows for early fast returns.  But for vectors, we consider them
+# to be rare, so early returns are not necessary.  So we first compute
+# the x**y values, and then check for special cases.
+
+# we do some of the checking in reverse order of the scalar version.
+# apply the negate result flags
+	orps	p_negateres(%rsp),%xmm0	# get negateres
+
+## if y is infinite or so large that the result would overflow or underflow
+	mov		p_y(%rsp),%edx			# get y
+	and 	$0x07fffffff,%edx					# develop ay
+	cmp		$0x04f000000,%edx
+	ja		.Ly_large
+.Lrnsx3:
+
+## if x is infinite
+	movdqa	p_ax(%rsp),%xmm4
+	cmpps	$0,.L__mask_inf(%rip),%xmm4	# equal to infinity, ffs if so.
+	movmskps %xmm4,%edx
+	test	$0x0f,%edx
+	jnz		.Lx_infinite
+.Lrnsx1:
+## if x is zero
+	xorps	%xmm4,%xmm4
+	cmpps	$0,p_ax(%rsp),%xmm4	# equal to zero, ffs if so.
+	movmskps %xmm4,%edx
+	test	$0x0f,%edx
+	jnz		.Lx_zero
+.Lrnsx2:
+## if y is NAN
+	movss	p_y(%rsp),%xmm4			# get y
+	ucomiss	%xmm4,%xmm4						# comparing y to itself should
+											# be true, unless y is a NaN. parity flag if NaN.
+	jp		.Ly_NaN
+.Lrnsx4:
+## if x is NAN
+	movdqa	p_ax(%rsp),%xmm4			# get x
+	cmpps	$4,%xmm4,%xmm4						# a compare not equal  of x to itself should
+											# be false, unless x is a NaN. ff's if NaN.
+	movmskps %xmm4,%ecx
+	test	$0x0f,%ecx
+	jnz		.Lx_NaN
+.Lrnsx5:
+
+## if x == +1, return +1 for all x
+	movdqa	.L__float_one(%rip),%xmm3	# one
+	mov		p_xptr(%rsp),%rdx		# get pointer to x
+	movdqa	%xmm3,%xmm2
+	cmpps	$4,(%rdx),%xmm2		# not equal to +1.0?, ffs if not equal.
+	andps	%xmm2,%xmm0						# keep the others
+	andnps	%xmm3,%xmm2						# mask for ones
+	orps	%xmm2,%xmm0
+
+.L__powf_cleanup2:
+
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+	.align 16
+.Ly_zero:
+## if |y| == 0	then return 1
+	movdqa	.L__float_one(%rip),%xmm0	# one
+	jmp		.L__powf_cleanup2
+#      * y is a NaN.
+.Ly_NaN:
+	mov		p_y(%rsp),%r8d
+	or		$0x000400000,%r8d				# convert to QNaNs
+	movd	%r8d,%xmm0					# propagate to all results
+	shufps	$0,%xmm0,%xmm0
+	jmp	   	.Lrnsx4
+
+#       y is a NaN.
+.Lx_NaN:
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	movdqa	(%rcx),%xmm4			# get x
+	movdqa	%xmm4,%xmm3
+	movdqa	%xmm4,%xmm5
+	movdqa	.L__mask_sigbit(%rip),%xmm2	# get the signalling bits
+	cmpps	$0,%xmm4,%xmm4						# a compare equal  of x to itself should
+											# be true, unless x is a NaN. 0's if NaN.
+	cmpps	$4,%xmm3,%xmm3						# compare not equal, ff's if NaN.
+	andps	%xmm4,%xmm0						# keep the other results
+	andps	%xmm3,%xmm2						# get just the right signalling bits
+	andps	%xmm5,%xmm3						# mask for the NaNs
+	orps	%xmm2,%xmm3	# convert to QNaNs
+	orps	%xmm3,%xmm0						# combine
+	jmp	   	.Lrnsx5
+
+#      * y is infinite or so large that the result would
+#         overflow or underflow.
+.Ly_large:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		(%rcx),%eax
+	mov		p_y(%rsp),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		4(%rcx),%eax
+	mov		p_y(%rsp),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		8(%rcx),%eax
+	mov		p_y(%rsp),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		12(%rcx),%eax
+	mov		p_y(%rsp),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx3
+
+# a subroutine to treat an individual x,y pair when y is large or infinity
+# assumes x in .Ly(%rip),%eax in ebx.
+# returns result in eax
+.Lnp_special6:
+# handle |x|==1 cases first
+	mov		$0x07FFFFFFF,%r8d
+	and		%eax,%r8d
+	cmp		$0x03f800000,%r8d					  # jump if |x| !=1
+	jnz		.Lnps6
+	mov		$0x03f800000,%eax					  # return 1 for all |x|==1
+	jmp 	.Lnpx64
+
+# cases where  |x| !=1
+.Lnps6:
+	mov		$0x07f800000,%ecx
+	xor		%eax,%eax							  # assume 0 return
+	test	$0x080000000,%ebx
+	jnz		.Lnps62							  # jump if y negative
+# y = +inf
+	cmp		$0x03f800000,%r8d
+	cmovg	%ecx,%eax							  # return inf if |x| < 1
+	jmp 	.Lnpx64
+.Lnps62:
+# y = -inf
+	cmp		$0x03f800000,%r8d
+	cmovl	%ecx,%eax							  # return inf if |x| < 1
+	jmp 	.Lnpx64
+
+.Lnpx64:
+	ret
+
+# handle cases where x is +/- infinity.  edx is the mask
+	.align 16
+.Lx_infinite:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	test	$1,%edx
+	jz		.Lxinfa
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		(%rcx),%eax
+	mov		p_y(%rsp),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+.Lxinfa:
+	test	$2,%edx
+	jz		.Lxinfb
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		4(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+.Lxinfb:
+	test	$4,%edx
+	jz		.Lxinfc
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		8(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+.Lxinfc:
+	test	$8,%edx
+	jz		.Lxinfd
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		12(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+.Lxinfd:
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx1
+
+# a subroutine to treat an individual x,y pair when x is +/-infinity
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+.Lnp_special_x1:											# x is infinite
+	test	$0x080000000,%eax								# is x positive
+	jnz		.Lnsx11										# jump if not
+	test	$0x080000000,%ebx								# is y positive
+	jz		.Lnsx13											# just return if so
+	xor		%eax,%eax										# else return 0
+	jmp 	.Lnsx13
+
+.Lnsx11:
+	cmp		$1,%ecx										## if inty ==1
+	jnz		.Lnsx12										# jump if not
+	test	$0x080000000,%ebx								# is y positive
+	jz		.Lnsx13											# just return if so
+	mov		$0x080000000,%eax								# else return -0
+	jmp 	.Lnsx13
+.Lnsx12:													# inty <>1
+	and		$0x07FFFFFFF,%eax								# return -x (|x|)  if y<0
+	test	$0x080000000,%ebx								# is y positive
+	jz		.Lnsx13											#
+	xor		%eax,%eax										# return 0  if y >=0
+.Lnsx13:
+	ret
+
+
+# handle cases where x is +/- zero.  edx is the mask of x,y pairs with |x|=0
+	.align 16
+.Lx_zero:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	test	$1,%edx
+	jz		.Lxzera
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+.Lxzera:
+	test	$2,%edx
+	jz		.Lxzerb
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		4(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+.Lxzerb:
+	test	$4,%edx
+	jz		.Lxzerc
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		8(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+.Lxzerc:
+	test	$8,%edx
+	jz		.Lxzerd
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		12(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2					# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+.Lxzerd:
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx2
+
+# a subroutine to treat an individual x,y pair when x is +/-0
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+	.align 16
+.Lnp_special_x2:
+	cmp		$1,%ecx										## if inty ==1
+	jz		.Lnsx21										# jump if so
+# handle cases of x=+/-0, y not integer
+	xor		%eax,%eax
+	mov		$0x07f800000,%ecx
+	test	$0x080000000,%ebx								# is ypos
+	cmovnz	%ecx,%eax
+	jmp		.Lnsx23
+# y is an integer
+.Lnsx21:
+	xor		%r8d,%r8d
+	mov		$0x07f800000,%ecx
+	test	$0x080000000,%ebx								# is ypos
+	cmovnz	%ecx,%r8d										# set to infinity if not
+	and		$0x080000000,%eax								# pickup the sign of x
+	or		%r8d,%eax										# and include it in the result
+.Lnsx23:
+	ret
+
+
+        .data
+        .align 64
+
+.L__mask_sign:			.quad 0x08000000080000000	# a sign bit mask
+				.quad 0x08000000080000000
+
+.L__mask_nsign:			.quad 0x07FFFFFFF7FFFFFFF	# a not sign bit mask
+				.quad 0x07FFFFFFF7FFFFFFF
+
+# used by special case checking
+
+.L__float_one:			.quad 0x03f8000003f800000	# one
+				.quad 0x03f8000003f800000
+
+.L__mask_inf:			.quad 0x07f8000007F800000	# inifinity
+				.quad 0x07f8000007F800000
+
+.L__mask_sigbit:		.quad 0x00040000000400000	# QNaN bit
+				.quad 0x00040000000400000
+
+

diff --git a/src/gas/vrs4sincosf.S b/src/gas/vrs4sincosf.S
new file mode 100644
index 0000000..2c3a0cc
--- /dev/null
+++ b/src/gas/vrs4sincosf.S

@@ -0,0 +1,1813 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4sincosf.asm
+#
+# A vector implementation of the sincos libm function.
+#
+# Prototype:
+#
+#    __vrs4_sincosf(__m128 x, __m128 * ys, __m128 * yc);
+#
+# Computes Sine and Cosine of x for an array of input values.
+# Places the Sine results into the supplied ys array and the Cosine results into the supplied yc array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 single precision Sine Cosine values at a time.
+# The four values are passed as packed single in xmm0.
+# The four Sine results are returned as packed singles in the supplied ys array.
+# The four Cosine results are returned as packed singles in the supplied yc array.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+
+# Author: Harsha Jagasia
+# Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff	#Sign bit zero
+			.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000	# 1.0
+			.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+			.quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000	# 0.5
+			.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555	# 0.166666666666
+			.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883	# twobypi
+			.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000	# piby2_1
+			.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331	# piby2_1tail
+			.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000	# piby2_2
+			.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073	# piby2_2tail
+			.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000	# mask for stripping head and tail
+			.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+			.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+			.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+			.quad 0
+.L__reald_one_zero:	.quad 0x00000000100000000	# sin_cos_filter
+			.quad 0
+.L__reald_zero_one:	.quad 0x00000000000000001	#
+			.quad 0
+.L__reald_two_zero:	.quad 0x00000000200000000	#
+			.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+			.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+			.quad 0x03ff0000000000000	#
+.L__real_zero:		.quad 0x00000000000000000	#
+			.quad 0x00000000000000000	#
+.L__real_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+
+.Lcosarray:
+	.quad	0x03FA5555555502F31		#  0.0416667			c1
+	.quad	0x03FA5555555502F31
+	.quad	0x0BF56C16BF55699D7		# -0.00138889			c2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x03EFA015C50A93B49		#  2.48016e-005			c3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x0BE92524743CC46B8		# -2.75573e-007			c4
+	.quad	0x0BE92524743CC46B8
+
+.Lsinarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BFC555555545E87D
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x03F811110DF01232D
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x0BE92524743CC46B8
+
+.Lcossinarray:
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BF56C16BF55699D7		#				c2
+	.quad	0x03F811110DF01232D
+	.quad	0x03EFA015C50A93B49		#				c3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x0BE92524743CC46B8		#				c4
+	.quad	0x03EC6DBE4AD1572D5
+
+.align 64
+	.Levensin_oddcos_tbl:
+
+		.quad	.Lsinsin_sinsin_piby4		# 0		*	; Done
+		.quad	.Lsinsin_sincos_piby4		# 1		+	; Done
+		.quad	.Lsinsin_cossin_piby4		# 2			; Done
+		.quad	.Lsinsin_coscos_piby4		# 3		+	; Done
+
+		.quad	.Lsincos_sinsin_piby4		# 4			; Done
+		.quad	.Lsincos_sincos_piby4		# 5		*	; Done
+		.quad	.Lsincos_cossin_piby4		# 6			; Done
+		.quad	.Lsincos_coscos_piby4		# 7			; Done
+
+		.quad	.Lcossin_sinsin_piby4		# 8			; Done
+		.quad	.Lcossin_sincos_piby4		# 9			; TBD
+		.quad	.Lcossin_cossin_piby4		# 10		*	; Done
+		.quad	.Lcossin_coscos_piby4		# 11			; Done
+
+		.quad	.Lcoscos_sinsin_piby4		# 12			; Done
+		.quad	.Lcoscos_sincos_piby4		# 13		+	; Done
+		.quad	.Lcoscos_cossin_piby4		# 14			; Done
+		.quad	.Lcoscos_coscos_piby4		# 15		*	; Done
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+    .text
+    .align 16
+    .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ	p_temp,0		# temporary for get/put bits operation
+.equ	p_temp1,0x10		# temporary for get/put bits operation
+
+.equ	save_xmm6,0x20		# temporary for get/put bits operation
+.equ	save_xmm7,0x30		# temporary for get/put bits operation
+.equ	save_xmm8,0x40		# temporary for get/put bits operation
+.equ	save_xmm9,0x50		# temporary for get/put bits operation
+.equ	save_xmm0,0x60		# temporary for get/put bits operation
+.equ	save_xmm11,0x70		# temporary for get/put bits operation
+.equ	save_xmm12,0x80		# temporary for get/put bits operation
+.equ	save_xmm13,0x90		# temporary for get/put bits operation
+.equ	save_xmm14,0x0A0		# temporary for get/put bits operation
+.equ	save_xmm15,0x0B0		# temporary for get/put bits operation
+
+.equ	r,0x0C0			# pointer to r for remainder_piby2
+.equ	rr,0x0D0		# pointer to r for remainder_piby2
+.equ	region,0x0E0		# pointer to r for remainder_piby2
+
+.equ	r1,0x0F0		# pointer to r for remainder_piby2
+.equ	rr1,0x0100		# pointer to r for remainder_piby2
+.equ	region1,0x0110		# pointer to r for remainder_piby2
+
+.equ	p_temp2,0x0120		# temporary for get/put bits operation
+.equ	p_temp3,0x0130		# temporary for get/put bits operation
+
+.equ	p_temp4,0x0140		# temporary for get/put bits operation
+.equ	p_temp5,0x0150		# temporary for get/put bits operation
+
+.equ	p_original,0x0160		# original x
+.equ	p_mask,0x0170		# original x
+.equ	p_sign_sin,0x0180		# original x
+
+.equ	p_original1,0x0190		# original x
+.equ	p_mask1,0x01A0		# original x
+.equ	p_sign1_sin,0x01B0		# original x
+
+
+.equ	save_r12,0x01C0		# temporary for get/put bits operation
+.equ	save_r13,0x01D0		# temporary for get/put bits operation
+
+.equ	p_sin,0x01E0		# sin
+.equ	p_cos,0x01F0		# cos
+
+.equ	save_rdi,0x0200		# temporary for get/put bits operation
+.equ	save_rsi,0x0210		# temporary for get/put bits operation
+
+.equ	p_sign_cos,0x0220		# Sign of lower cos term
+.equ	p_sign1_cos,0x0230		# Sign of upper cos term
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.globl __vrs4_sincosf
+    .type   __vrs4_sincosf,@function
+__vrs4_sincosf:
+
+	sub	$0x0248,%rsp
+
+	mov	%r12,save_r12(%rsp)	# save r12
+
+	mov	%r13,save_r13(%rsp)	# save r13
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+	movhlps		%xmm0,%xmm8
+	cvtps2pd	%xmm0,%xmm10			# convert input to double.
+	cvtps2pd	%xmm8,%xmm1			# convert input to double.
+
+movdqa	%xmm10,%xmm6
+movdqa	%xmm1,%xmm7
+movapd	.L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd 	%xmm2,%xmm10				#Unsign
+andpd 	%xmm2,%xmm1				#Unsign
+
+mov	%rdi, p_sin(%rsp)			# save address for sin return
+mov	%rsi,  p_cos(%rsp)			# save address for cos return
+
+movd	%xmm10,%rax				#rax is lower arg
+movhpd	%xmm10, p_temp+8(%rsp)			#
+mov    	p_temp+8(%rsp),%rcx			#rcx = upper arg
+
+movd	%xmm1,%r8				#r8 is lower arg
+movhpd	%xmm1, p_temp1+8(%rsp)			#
+mov    	p_temp1+8(%rsp),%r9			#r9 = upper arg
+
+movdqa	%xmm10,%xmm12
+movdqa	%xmm1,%xmm13
+
+pcmpgtd		%xmm6,%xmm12
+pcmpgtd		%xmm7,%xmm13
+movdqa		%xmm12,%xmm6
+movdqa		%xmm13,%xmm7
+psrldq		$4,%xmm12
+psrldq		$4,%xmm13
+psrldq		$8,%xmm6
+psrldq		$8,%xmm7
+
+mov 	$0x3FE921FB54442D18,%rdx			#piby4	+
+mov	$0x411E848000000000,%r10			#5e5	+
+movapd	.L__real_3fe0000000000000(%rip),%xmm4		#0.5 for later use +
+
+por	%xmm6,%xmm12
+por	%xmm7,%xmm13
+
+movd	%xmm12,%r12				#Move Sign to gpr **
+movd	%xmm13,%r13				#Move Sign to gpr **
+
+movapd	%xmm10,%xmm2				#x0
+movapd	%xmm1,%xmm3				#x1
+movapd	%xmm10,%xmm6				#x0
+movapd	%xmm1,%xmm7				#x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax
+	jae	.Lfirst_or_next3_arg_gt_5e5
+
+	cmp	%r10,%rcx
+	jae	.Lsecond_or_next2_arg_gt_5e5
+
+	cmp	%r10,%r8
+	jae	.Lthird_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfourth_arg_gt_5e5
+
+
+#      /* Find out what multiple of piby2 */
+#        npi2  = (int)(x * twobypi + 0.5);
+	movapd	.L__real_3fe45f306dc9c883(%rip),%xmm10
+	mulpd	%xmm10,%xmm2						# * twobypi
+	mulpd	%xmm10,%xmm3						# * twobypi
+
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%r8						# Region
+	movd	%xmm5,%r9						# Region
+
+#DELETE
+#	mov 	.LQWORD,%rdx PTR __reald_one_zero			;compare value for cossin path
+#DELETE
+
+	mov	%r8,%r10
+	mov	%r9,%r11
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm1,%xmm7						# t-rhead
+
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign
+# xmm5  = npi2 (int), xmm1 =rhead, xmm9 =rtail,  r9 = region, r11 = region, r13 = Sign
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+# NEW
+
+	#ADDED
+	mov	%r10,%rdi				# npi2 in int
+	mov	%r11,%rsi				# npi2 in int
+	#ADDED
+
+	shr	$1,%r10					# 0 and 1 => 0
+	shr	$1,%r11					# 2 and 3 => 1
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	#ADDED
+	xor	%r10,%rdi				# xor last 2 bits of region for cos
+	xor	%r11,%rsi				# xor last 2 bits of region for cos
+	#ADDED
+
+	not 	%r12					#~(sign)
+	not 	%r13					#~(sign)
+	and	%r12,%r10				#region & ~(sign)
+	and	%r13,%r11				#region & ~(sign)
+
+	not	%rax					#~(region)
+	not	%rcx					#~(region)
+	not	%r12					#~~(sign)
+	not	%r13					#~~(sign)
+	and	%r12,%rax				#~region & ~~(sign)
+	and	%r13,%rcx				#~region & ~~(sign)
+
+	#ADDED
+	and	.L__reald_one_one(%rip),%rdi		# sign for cos
+	and	.L__reald_one_one(%rip),%rsi		# sign for cos
+	#ADDED
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10		# sign for sin
+	and	.L__reald_one_one(%rip),%r11		# sign for sin
+
+
+
+
+
+
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	#ADDED
+	mov	%rdi,%rax
+	mov	%rsi,%rcx
+	#ADDED
+
+	and	.L__reald_one_zero(%rip),%r12		#mask out the lower sign bit leaving the upper sign bit
+	and	.L__reald_one_zero(%rip),%r13		#mask out the lower sign bit leaving the upper sign bit
+
+	#ADDED
+	and	.L__reald_one_zero(%rip),%rax		#mask out the lower sign bit leaving the upper sign bit
+	and	.L__reald_one_zero(%rip),%rcx		#mask out the lower sign bit leaving the upper sign bit
+	#ADDED
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	#ADDED
+	shl	$63,%rdi				#shift lower sign bit left by 63 bits
+	shl	$63,%rsi				#shift lower sign bit left by 63 bits
+	shl	$31,%rax				#shift upper sign bit left by 31 bits
+	shl	$31,%rcx				#shift upper sign bit left by 31 bits
+	#ADDED
+
+	mov 	 %r10,p_sign_sin(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign_sin+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1_sin(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1_sin+8(%rsp)	#write out upper sign bit
+
+	mov 	 %rdi,p_sign_cos(%rsp)		#write out lower sign bit
+	mov 	 %rax,p_sign_cos+8(%rsp)		#write out upper sign bit
+	mov 	 %rsi,p_sign1_cos(%rsp)		#write out lower sign bit
+	mov 	 %rcx,p_sign1_cos+8(%rsp)	#write out upper sign bit
+
+# NEW
+
+# GET_BITS_DP64(rhead-rtail, uy);			   		; originally only rhead
+# xmm4  = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 =rhead, xmm9 =rtail
+	movapd	%xmm10,%xmm6						# rhead
+	movapd	%xmm1,%xmm7						# rhead
+
+	subpd	%xmm8,%xmm10						# r = rhead - rtail
+	subpd	%xmm9,%xmm1						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+#	subpd	%xmm10,%xmm6				;rr=rhead-r
+#	subpd	%xmm1,%xmm7				;rr=rhead-r
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2				# move r for r2
+	movapd	%xmm1,%xmm3				# move r for r2
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+#	subpd	xmm6, xmm8				;rr=(rhead-r) -rtail
+#	subpd	xmm7, xmm9				;rr=(rhead-r) -rtail
+
+	and	.L__reald_zero_one(%rip),%rax		# region for jump table
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+
+# HARSHA ADDED
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_sin   = Sign, p_sign_cos = Sign,  xmm10 = r, xmm2 = r2
+# p_sign1_sin  = Sign, p_sign1_cos = Sign, xmm1 = r,  xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm14					# for x3
+	movapd	%xmm3,%xmm15					# for x3
+
+	movapd	%xmm2,%xmm0					# for r
+	movapd	%xmm3,%xmm11					# for r
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm6			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm7			# c4
+
+	movapd	.Lsinarray+0x10(%rip),%xmm12			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm13			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm10,%xmm14					# x3
+	mulpd	%xmm1,%xmm15					# x3
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm6					# c2*x2
+	mulpd	%xmm3,%xmm7					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# c4*x2
+	mulpd	%xmm3,%xmm13					# c4*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0	;trash r
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	addpd	.Lcosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9			# c1+x2c2
+
+	addpd	.Lsinarray+0x20(%rip),%xmm6			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm7			# c3+x2c4
+
+	addpd	.Lsinarray(%rip),%xmm12			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm13			# c1+x2c2
+
+	mulpd	%xmm2,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm5					# x4(c3+x2c4)
+
+	mulpd	%xmm2,%xmm6					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm7					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	addpd	%xmm12,%xmm6					# zs
+	addpd	%xmm13,%xmm7					# zs
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	mulpd	%xmm14,%xmm6					# x3 * zs
+	mulpd	%xmm15,%xmm7					# x3 * zs
+
+	subpd   %xmm0,%xmm4					# - (-t)
+	subpd   %xmm11,%xmm5					# - (-t)
+
+	addpd	%xmm10,%xmm6					# +x
+	addpd	%xmm1,%xmm7					# +x
+
+# HARSHA ADDED
+
+	lea	.Levensin_oddcos_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)					#Jmp table for cos/sin calculation based on even/odd region
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+#	    %xmm11,,%xmm9 xmm13
+
+
+	movlpd	 %xmm10,r(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm10,%xmm10			#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm8 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm12 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	%xmm6,%xmm10
+       subsd	%xmm0,%xmm10					# xmm10 = r=(rhead-rtail)
+       subsd	%xmm10,%xmm6					# rr=rhead-r
+       subsd	%xmm0,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm10,r+8(%rsp)			# store upper r
+       movlpd	 %xmm6,rr+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_lower_naninf
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r(%rsp),%rdi				#Restore lower fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_sincosf_lower_naninf:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			# region =0
+
+.align 16
+0:
+
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+	movhlps %xmm10,%xmm6		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+	movd	%xmm10,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	jmp 	0f
+
+.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5
+
+
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm6,%rdi				#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+#   %xmm9,,%xmm5 xmm11, xmm13
+
+	movhpd	 %xmm10,r+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm7 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store upper region
+
+#       movsd	%xmm6,%xmm10
+#       subsd	xmm10,xmm0					; xmm10 = r=(rhead-rtail)
+#       subsd	%xmm10,%xmm6					; rr=rhead-r
+#       subsd	xmm6, xmm0					; xmm6 = rr=((rhead-r) -rtail)
+
+        subsd	%xmm0,%xmm6					# xmm10 = r=(rhead-rtail)
+
+#       movlpd	QWORD PTR r[rsp], xmm10				; store upper r
+#       movlpd	QWORD PTR rr[rsp], xmm6				; store upper rr
+
+        movlpd	 %xmm6,r(%rsp)				# store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_upper_naninf
+
+	mov	 %r8,p_temp(%rsp)
+	mov	 %r9,p_temp2(%rsp)
+	movapd	 %xmm1,p_temp1(%rsp)
+	movapd	 %xmm3,p_temp3(%rsp)
+	movapd	 %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r+8(%rsp),%rdi				#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	p_temp(%rsp),%r8
+	mov	p_temp2(%rsp),%r9
+	movapd	p_temp1(%rsp),%xmm1
+	movapd	p_temp3(%rsp),%xmm3
+	movapd	p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_sincosf_upper_naninf:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			# region =0
+
+.align 16
+0:
+
+	jmp	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r8
+	jae	.Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4			#Restore 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm3						# * twobypi
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm5,region1(%rsp)						# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm1,%xmm7						# t-rhead
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm1,%xmm7						; rhead
+       subpd	%xmm9,%xmm1						# r = rhead - rtail
+       movapd	 %xmm1,r1(%rsp)
+
+#       subpd	%xmm1,%xmm7						; rr=rhead-r
+#       subpd	xmm7, xmm9						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr1[rsp], xmm7
+
+	jmp	.L__vrs4_sincosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use 	%xmm11,,%xmm9 xmm13
+# 	%xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm10,%xmm6						; rhead
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	 %xmm10,r(%rsp)
+
+#       subpd	%xmm10,%xmm6						; rr=rhead-r
+#       subpd	xmm6, xmm8						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+	cmp	%r10,%r9
+	jae	.Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movlpd	 %xmm1,r1(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm1,%xmm1			#Needed since we want to work on upper arg
+	movhlps	%xmm3,%xmm3
+	movhlps	%xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r9d				# r9d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r9d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r9d,region1+4(%rsp)			# store upper region
+
+
+       subsd	%xmm10,%xmm7					# xmm1 = r=(rhead-rtail)
+
+       movlpd	 %xmm7,r1+8(%rsp)			# store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_lower_naninf_higher
+
+	lea	 region1(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r1(%rsp),%rdi				#Restore lower fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sincosf_lower_naninf_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.L__vrs4_sincosf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+	movhlps %xmm1,%xmm7		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher
+
+	mov	  %r9,p_temp1(%rsp)			#Save upper arg
+	lea	  region1(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	  r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm1,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp1(%rsp),%r9			#Restore upper arg
+
+
+	jmp 	0f
+
+.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher:				#lower arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher
+
+	lea	 region1+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm7,%rdi			#Restore upper fp arg for remainder_piby2 call
+
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			#region = 0
+
+.align 16
+0:
+
+	jmp 	.L__vrs4_sincosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm10,%xmm6						; rhead
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	 %xmm10,r(%rsp)
+
+#       subpd	%xmm10,%xmm6						; rr=rhead-r
+#       subpd	xmm6, xmm8						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd	 %xmm1,r1+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r8d				# r8d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r8d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r8d,region1(%rsp)			# store lower region
+
+#       movsd	%xmm7,%xmm1
+#       subsd	xmm1, xmm10					; xmm10 = r=(rhead-rtail)
+#       subsd	%xmm1,%xmm7					; rr=rhead-r
+#       subsd	xmm7, xmm10					; xmm6 = rr=((rhead-r) -rtail)
+
+        subsd	%xmm10,%xmm7					# xmm10 = r=(rhead-rtail)
+
+#       movlpd	QWORD PTR r1[rsp], xmm1				; store upper r
+#       movlpd	QWORD PTR rr1[rsp], xmm7			; store upper rr
+
+        movlpd	 %xmm7,r1(%rsp)				# store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_upper_naninf_higher
+
+	lea	 region1+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r1+8(%rsp),%rdi				#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sincosf_upper_naninf_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp	.L__vrs4_sincosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sincosf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_sin  = Sign,  ; p_sign_cos   = Sign, xmm10 = r, xmm2 = r2
+# p_sign1_sin  = Sign, ; p_sign1_cos  = Sign, xmm1 = r, xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	r(%rsp),%xmm10
+	movapd	r1(%rsp),%xmm1
+
+	mov	region(%rsp),%r8
+	mov	region1(%rsp),%r9
+
+	mov 	%r8,%r10
+	mov 	%r9,%r11
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+
+# NEW
+
+	#ADDED
+	mov	%r10,%rdi
+	mov	%r11,%rsi
+	#ADDED
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	#ADDED
+	xor	%r10,%rdi
+	xor	%r11,%rsi
+	#ADDED
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	#ADDED
+	and	.L__reald_one_one(%rip),%rdi				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%rsi				#(~AB+A~B)&1
+	#ADDED
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+
+
+
+
+
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	#ADDED
+	mov	%rdi,%rax
+	mov	%rsi,%rcx
+	#ADDED
+
+	and	.L__reald_one_zero(%rip),%r12		#mask out the lower sign bit leaving the upper sign bit
+	and	.L__reald_one_zero(%rip),%r13		#mask out the lower sign bit leaving the upper sign bit
+
+	#ADDED
+	and	.L__reald_one_zero(%rip),%rax		#mask out the lower sign bit leaving the upper sign bit
+	and	.L__reald_one_zero(%rip),%rcx		#mask out the lower sign bit leaving the upper sign bit
+	#ADDED
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	#ADDED
+	shl	$63,%rdi				#shift lower sign bit left by 63 bits
+	shl	$63,%rsi				#shift lower sign bit left by 63 bits
+	shl	$31,%rax				#shift upper sign bit left by 31 bits
+	shl	$31,%rcx				#shift upper sign bit left by 31 bits
+	#ADDED
+
+	mov 	 %r10,p_sign_sin(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign_sin+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1_sin(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1_sin+8(%rsp)	#write out upper sign bit
+
+	mov 	 %rdi,p_sign_cos(%rsp)		#write out lower sign bit
+	mov 	 %rax,p_sign_cos+8(%rsp)		#write out upper sign bit
+	mov 	 %rsi,p_sign1_cos(%rsp)		#write out lower sign bit
+	mov 	 %rcx,p_sign1_cos+8(%rsp)	#write out upper sign bit
+#NEW
+
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+
+# HARSHA ADDED
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_cos  = Sign, p_sign_sin  = Sign, xmm10 = r, xmm2 = r2
+# p_sign1_cos = Sign, p_sign1_sin = Sign, xmm1 = r,  xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm14					# for x3
+	movapd	%xmm3,%xmm15					# for x3
+
+	movapd	%xmm2,%xmm0					# for r
+	movapd	%xmm3,%xmm11					# for r
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm6			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm7			# c4
+
+	movapd	.Lsinarray+0x10(%rip),%xmm12			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm13			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm10,%xmm14					# x3
+	mulpd	%xmm1,%xmm15					# x3
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm6					# c2*x2
+	mulpd	%xmm3,%xmm7					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# c4*x2
+	mulpd	%xmm3,%xmm13					# c4*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0	;trash r
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	addpd	.Lcosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9			# c1+x2c2
+
+	addpd	.Lsinarray+0x20(%rip),%xmm6			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm7			# c3+x2c4
+
+	addpd	.Lsinarray(%rip),%xmm12			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm13			# c1+x2c2
+
+	mulpd	%xmm2,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm5					# x4(c3+x2c4)
+
+	mulpd	%xmm2,%xmm6					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm7					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	addpd	%xmm12,%xmm6					# zs
+	addpd	%xmm13,%xmm7					# zs
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	mulpd	%xmm14,%xmm6					# x3 * zs
+	mulpd	%xmm15,%xmm7					# x3 * zs
+
+	subpd   %xmm0,%xmm4					# - (-t)
+	subpd   %xmm11,%xmm5					# - (-t)
+
+	addpd	%xmm10,%xmm6					# +x
+	addpd	%xmm1,%xmm7					# +x
+
+# HARSHA ADDED
+
+
+	lea	.Levensin_oddcos_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)					#Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sincosf_cleanup:
+
+	mov	p_sin(%rsp),%rdi
+	mov	p_cos(%rsp),%rsi
+
+	movapd	  p_sign_cos(%rsp),%xmm10
+	movapd	  p_sign1_cos(%rsp),%xmm1
+
+
+	xorpd	%xmm4,%xmm10			# Cos term   (+) Sign
+	xorpd	%xmm5,%xmm1			# Cos term   (+) Sign
+
+	cvtpd2ps %xmm10,%xmm0
+	cvtpd2ps %xmm1,%xmm11
+
+	movapd	  p_sign_sin(%rsp),%xmm14
+	movapd	  p_sign1_sin(%rsp),%xmm15
+
+	xorpd	%xmm6,%xmm14			# Sin term (+) Sign
+	xorpd	%xmm7,%xmm15			# Sin term (+) Sign
+
+	cvtpd2ps %xmm14,%xmm12
+	cvtpd2ps %xmm15,%xmm13
+
+	movlps	 %xmm0,(%rsi) 				# save the cos
+	movlps	 %xmm12,(%rdi) 				# save the sin
+	movlps	 %xmm11,8(%rsi)				# save the cos
+	movlps	 %xmm13,8(%rdi)			# save the sin
+
+	mov	save_r12(%rsp),%r12	# restore r12
+	mov	save_r13(%rsp),%r13	# restore r13
+
+	add	$0x0248,%rsp
+	ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 16
+.Lcoscos_coscos_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower and Upper Even
+
+	movapd	%xmm4,%xmm8
+	movapd	%xmm5,%xmm9
+
+	movapd	%xmm6,%xmm4
+	movapd	%xmm7,%xmm5
+
+	movapd	%xmm8,%xmm6
+	movapd	%xmm9,%xmm7
+
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+
+	movhlps	%xmm5,%xmm9
+	movhlps	%xmm7,%xmm13
+
+	movlhps	%xmm9,%xmm7
+	movlhps	%xmm13,%xmm5
+
+	movhlps	%xmm4,%xmm8
+	movhlps	%xmm6,%xmm12
+
+	movlhps	%xmm8,%xmm6
+	movlhps	%xmm12,%xmm4
+
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsincos_cossin_piby4:
+	movsd	%xmm5,%xmm9
+	movsd	%xmm7,%xmm13
+
+	movsd	%xmm9,%xmm7
+	movsd	%xmm13,%xmm5
+
+	movhlps	%xmm4,%xmm8
+	movhlps	%xmm6,%xmm12
+
+	movlhps	%xmm8,%xmm6
+	movlhps	%xmm12,%xmm4
+
+	jmp	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+	movsd	%xmm5,%xmm9
+	movsd	%xmm7,%xmm13
+
+	movsd	%xmm9,%xmm7
+	movsd	%xmm13,%xmm5
+
+	movsd	%xmm4,%xmm8
+	movsd	%xmm6,%xmm12
+
+	movsd	%xmm8,%xmm6
+	movsd	%xmm12,%xmm4
+
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+	movhlps	%xmm5,%xmm9
+	movhlps	%xmm7,%xmm13
+
+	movlhps	%xmm9,%xmm7
+	movlhps	%xmm13,%xmm5
+
+	movsd	%xmm4,%xmm8
+	movsd	%xmm6,%xmm12
+
+	movsd	%xmm8,%xmm6
+	movsd	%xmm12,%xmm4
+
+	jmp	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcoscos_sinsin_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower even, Upper odd, Swap upper
+
+	movapd	%xmm5,%xmm9
+	movapd	%xmm7,%xmm5
+	movapd	%xmm9,%xmm7
+
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower odd, Upper even, Swap lower
+
+	movapd	%xmm4,%xmm8
+	movapd	%xmm6,%xmm4
+	movapd	%xmm8,%xmm6
+
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcoscos_cossin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+	movapd	%xmm5,%xmm9
+	movapd	%xmm7,%xmm5
+	movapd	%xmm9,%xmm7
+
+	movhlps	%xmm4,%xmm8
+	movhlps	%xmm6,%xmm12
+
+	movlhps	%xmm8,%xmm6
+	movlhps	%xmm12,%xmm4
+
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+	movapd	%xmm5,%xmm9
+	movapd	%xmm7,%xmm5
+	movapd	%xmm9,%xmm7
+
+	movsd	%xmm4,%xmm8
+	movsd	%xmm6,%xmm12
+
+	movsd	%xmm8,%xmm6
+	movsd	%xmm12,%xmm4
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+	movapd	%xmm4,%xmm8
+	movapd	%xmm6,%xmm4
+	movapd	%xmm8,%xmm6
+
+	movhlps	%xmm5,%xmm9
+	movhlps	%xmm7,%xmm13
+
+	movlhps	%xmm9,%xmm7
+	movlhps	%xmm13,%xmm5
+
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+	movhlps	%xmm5,%xmm9
+	movhlps	%xmm7,%xmm13
+
+	movlhps	%xmm9,%xmm7
+	movlhps	%xmm13,%xmm5
+
+	jmp 	.L__vrs4_sincosf_cleanup
+
+
+.align 16
+.Lsincos_coscos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+	movapd	%xmm4,%xmm8
+	movapd	%xmm6,%xmm4
+	movapd	%xmm8,%xmm6
+
+	movsd	%xmm5,%xmm9
+	movsd	%xmm7,%xmm13
+
+	movsd	%xmm9,%xmm7
+	movsd	%xmm13,%xmm5
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+	movsd	%xmm5,%xmm9
+	movsd	%xmm7,%xmm5
+	movsd	%xmm9,%xmm7
+
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+	movhlps	%xmm4,%xmm8
+	movhlps	%xmm6,%xmm12
+
+	movlhps	%xmm8,%xmm6
+	movlhps	%xmm12,%xmm4
+
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+	movsd	%xmm4,%xmm8
+	movsd	%xmm6,%xmm4
+	movsd	%xmm8,%xmm6
+	jmp 	.L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+# Lower and Upper odd, So Swap
+
+	jmp 	.L__vrs4_sincosf_cleanup

diff --git a/src/gas/vrs4sinf.S b/src/gas/vrs4sinf.S
new file mode 100644
index 0000000..3744f33
--- /dev/null
+++ b/src/gas/vrs4sinf.S

@@ -0,0 +1,2171 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4sinf.s
+#
+# A vector implementation of the sin libm function.
+#
+# Prototype:
+#
+#    __m128 __vrs4_sinf(__m128 x);
+#
+# Computes Sine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 single precision Sine values at a time.
+# The four values are passed as packed single in xmm10.
+# The four results are returned as packed singles in xmm10.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# Author: Harsha Jagasia
+# Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff	#Sign bit zero
+			.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000	# 1.0
+			.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+			.quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000	# 0.5
+			.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555	# 0.166666666666
+			.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883	# twobypi
+			.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000	# piby2_1
+			.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331	# piby2_1tail
+			.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000	# piby2_2
+			.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073	# piby2_2tail
+			.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000	# mask for stripping head and tail
+			.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+			.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+			.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+			.quad 0
+.L__reald_one_zero:	.quad 0x00000000100000000	# sin_cos_filter
+			.quad 0
+.L__reald_zero_one:	.quad 0x00000000000000001	#
+			.quad 0
+.L__reald_two_zero:	.quad 0x00000000200000000	#
+			.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001		#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+			.quad 0x00000000000000002		#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+			.quad 0x03ff0000000000000		#
+.L__real_zero:		.quad 0x00000000000000000		#
+			.quad 0x00000000000000000		#
+.L__real_one:		.quad 0x00000000000000001		#
+			.quad 0x00000000000000001		#
+
+.Lcosarray:
+	.quad	0x03FA5555555502F31		#  0.0416667			c1
+	.quad	0x03FA5555555502F31
+	.quad	0x0BF56C16BF55699D7		# -0.00138889			c2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x03EFA015C50A93B49		#  2.48016e-005			c3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x0BE92524743CC46B8		# -2.75573e-007			c4
+	.quad	0x0BE92524743CC46B8
+
+.Lsinarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BFC555555545E87D
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x03F811110DF01232D
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x0BE92524743CC46B8
+
+.Lcossinarray:
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BF56C16BF55699D7		#				c2
+	.quad	0x03F811110DF01232D
+	.quad	0x03EFA015C50A93B49		#				c3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x0BE92524743CC46B8		#				c4
+	.quad	0x03EC6DBE4AD1572D5
+
+.align 64
+	.Levensin_oddcos_tbl:
+
+		.quad	.Lsinsin_sinsin_piby4		# 0		*	; Done
+		.quad	.Lsinsin_sincos_piby4		# 1		+	; Done
+		.quad	.Lsinsin_cossin_piby4		# 2			; Done
+		.quad	.Lsinsin_coscos_piby4		# 3		+	; Done
+
+		.quad	.Lsincos_sinsin_piby4		# 4			; Done
+		.quad	.Lsincos_sincos_piby4		# 5		*	; Done
+		.quad	.Lsincos_cossin_piby4		# 6			; Done
+		.quad	.Lsincos_coscos_piby4		# 7			; Done
+
+		.quad	.Lcossin_sinsin_piby4		# 8			; Done
+		.quad	.Lcossin_sincos_piby4		# 9			; TBD
+		.quad	.Lcossin_cossin_piby4		# 10		*	; Done
+		.quad	.Lcossin_coscos_piby4		# 11			; Done
+
+		.quad	.Lcoscos_sinsin_piby4		# 12			; Done
+		.quad	.Lcoscos_sincos_piby4		# 13		+	; Done
+		.quad	.Lcoscos_cossin_piby4		# 14			; Done
+		.quad	.Lcoscos_coscos_piby4		# 15		*	; Done
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+    .text
+    .align 16
+    .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ	p_temp,0		# temporary for get/put bits operation
+.equ	p_temp1,0x10		# temporary for get/put bits operation
+
+.equ	save_xmm6,0x20		# temporary for get/put bits operation
+.equ	save_xmm7,0x30		# temporary for get/put bits operation
+.equ	save_xmm8,0x40		# temporary for get/put bits operation
+.equ	save_xmm9,0x50		# temporary for get/put bits operation
+.equ	save_xmm0,0x60		# temporary for get/put bits operation
+.equ	save_xmm11,0x70		# temporary for get/put bits operation
+.equ	save_xmm12,0x80		# temporary for get/put bits operation
+.equ	save_xmm13,0x90		# temporary for get/put bits operation
+.equ	save_xmm14,0x0A0	# temporary for get/put bits operation
+.equ	save_xmm15,0x0B0	# temporary for get/put bits operation
+
+.equ	r,0x0C0			# pointer to r for remainder_piby2
+.equ	rr,0x0D0		# pointer to r for remainder_piby2
+.equ	region,0x0E0		# pointer to r for remainder_piby2
+
+.equ	r1,0x0F0		# pointer to r for remainder_piby2
+.equ	rr1,0x0100		# pointer to r for remainder_piby2
+.equ	region1,0x0110		# pointer to r for remainder_piby2
+
+.equ	p_temp2,0x0120		# temporary for get/put bits operation
+.equ	p_temp3,0x0130		# temporary for get/put bits operation
+
+.equ	p_temp4,0x0140		# temporary for get/put bits operation
+.equ	p_temp5,0x0150		# temporary for get/put bits operation
+
+.equ	p_original,0x0160		# original x
+.equ	p_mask,0x0170		# original x
+.equ	p_sign,0x0180		# original x
+
+.equ	p_original1,0x0190		# original x
+.equ	p_mask1,0x01A0		# original x
+.equ	p_sign1,0x01B0		# original x
+
+.equ	save_r12,0x01C0		# temporary for get/put bits operation
+.equ	save_r13,0x01D0		# temporary for get/put bits operation
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.globl __vrs4_sinf
+    .type   __vrs4_sinf,@function
+__vrs4_sinf:
+
+	sub		$0x01E8,%rsp
+
+	mov	%r12,save_r12(%rsp)	# save r12
+
+	mov	%r13,save_r13(%rsp)	# save r13
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+	movhlps		%xmm0,%xmm8
+	cvtps2pd	%xmm0,%xmm10			# convert input to double.
+	cvtps2pd	%xmm8,%xmm1			# convert input to double.
+
+movdqa	%xmm10,%xmm6
+movdqa	%xmm1,%xmm7
+movapd	.L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd 	%xmm2,%xmm10	#Unsign
+andpd 	%xmm2,%xmm1	#Unsign
+
+movd	%xmm10,%rax				#rax is lower arg
+movhpd	%xmm10, p_temp+8(%rsp)			#
+mov    	p_temp+8(%rsp),%rcx			#rcx = upper arg
+
+movd	%xmm1,%r8				#r8 is lower arg
+movhpd	%xmm1, p_temp1+8(%rsp)			#
+mov    	p_temp1+8(%rsp),%r9			#r9 = upper arg
+
+movdqa	%xmm10,%xmm12
+movdqa	%xmm1,%xmm13
+
+pcmpgtd		%xmm6,%xmm12
+pcmpgtd		%xmm7,%xmm13
+movdqa		%xmm12,%xmm6
+movdqa		%xmm13,%xmm7
+psrldq		$4,%xmm12
+psrldq		$4,%xmm13
+psrldq		$8,%xmm6
+psrldq		$8,%xmm7
+
+mov 	$0x3FE921FB54442D18,%rdx			#piby4	+
+mov	$0x411E848000000000,%r10			#5e5	+
+movapd	.L__real_3fe0000000000000(%rip),%xmm4		#0.5 for later use +
+
+por	%xmm6,%xmm12
+por	%xmm7,%xmm13
+movd	%xmm12,%r12				#Move Sign to gpr **
+movd	%xmm13,%r13				#Move Sign to gpr **
+
+movapd	%xmm10,%xmm2				#x0
+movapd	%xmm1,%xmm3				#x1
+movapd	%xmm10,%xmm6				#x0
+movapd	%xmm1,%xmm7				#x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax
+	jae	.Lfirst_or_next3_arg_gt_5e5
+
+	cmp	%r10,%rcx
+	jae	.Lsecond_or_next2_arg_gt_5e5
+
+	cmp	%r10,%r8
+	jae	.Lthird_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfourth_arg_gt_5e5
+
+
+#      /* Find out what multiple of piby2 */
+#        npi2  = (int)(x * twobypi + 0.5);
+	movapd	.L__real_3fe45f306dc9c883(%rip),%xmm10
+	mulpd	%xmm10,%xmm2						# * twobypi
+	mulpd	%xmm10,%xmm3						# * twobypi
+
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%r8						# Region
+	movd	%xmm5,%r9						# Region
+
+	mov 	.L__reald_one_zero(%rip),%rdx			#compare value for cossin path
+	mov	%r8,%r10
+	mov	%r9,%r11
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm1,%xmm7						# t-rhead
+
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm10 =rhead, xmm8 =rtail
+# xmm5  = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+# GET_BITS_DP64(rhead-rtail, uy);			   		; originally only rhead
+# xmm4  = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 =rhead, xmm9 =rtail
+	movapd	%xmm10,%xmm6						# rhead
+	movapd	%xmm1,%xmm7						# rhead
+
+	subpd	%xmm8,%xmm10						# r = rhead - rtail
+	subpd	%xmm9,%xmm1						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2				# move r for r2
+	movapd	%xmm1,%xmm3				# move r for r2
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+	lea	.Levensin_oddcos_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+#	    %xmm11,,%xmm9 xmm13
+
+
+	movlpd	 %xmm10,r(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm10,%xmm10			#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm8 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm12 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	%xmm6,%xmm10
+       subsd	%xmm0,%xmm10					# xmm10 = r=(rhead-rtail)
+       movlpd	 %xmm10,r+8(%rsp)			# store upper r
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_lower_naninf
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r(%rsp),%rdi	#Restore lower fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_sinf_lower_naninf:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+	movhlps %xmm10,%xmm6		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+	movd	 %xmm10,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	jmp 	0f
+
+.L__vrs4_sinf_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+#	mov	.LQWORD,%rax PTR p_original[rsp]
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_upper_naninf_of_both_gt_5e5
+
+
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm6,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrs4_sinf_upper_naninf_of_both_gt_5e5:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+#   %xmm9,,%xmm5 xmm11, xmm13
+
+	movhpd	 %xmm10,r+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm7 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     				# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store upper region
+
+        subsd	%xmm0,%xmm6					# xmm10 = r=(rhead-rtail)
+
+        movlpd	 %xmm6,r(%rsp)				# store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_upper_naninf
+
+	mov	 %r8,p_temp(%rsp)
+	mov	 %r9,p_temp2(%rsp)
+	movapd	 %xmm1,p_temp1(%rsp)
+	movapd	 %xmm3,p_temp3(%rsp)
+	movapd	 %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r+8(%rsp),%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	p_temp(%rsp),%r8
+	mov	p_temp2(%rsp),%r9
+	movapd	p_temp1(%rsp),%xmm1
+	movapd	p_temp3(%rsp),%xmm3
+	movapd	p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_sinf_upper_naninf:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r8
+	jae	.Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4			#Restore 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm3			# * twobypi
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm5,region1(%rsp)						# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm1,%xmm7						# t-rhead
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm1,%xmm7						; rhead
+       subpd	%xmm9,%xmm1						# r = rhead - rtail
+       movapd	 %xmm1,r1(%rsp)
+
+#       subpd	%xmm1,%xmm7						; rr=rhead-r
+#       subpd	xmm7, xmm9						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr1[rsp], xmm7
+
+	jmp	.L__vrs4_sinf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use 	%xmm11,,%xmm9 xmm13
+# 	%xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm10,%xmm6						; rhead
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	 %xmm10,r(%rsp)
+
+#       subpd	%xmm10,%xmm6						; rr=rhead-r
+#       subpd	xmm6, xmm8						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+	cmp	%r10,%r9
+	jae	.Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movlpd	 %xmm1,r1(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm1,%xmm1			#Needed since we want to work on upper arg
+	movhlps	%xmm3,%xmm3
+	movhlps	%xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r9d				# r9d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r9d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r9d,region1+4(%rsp)			# store upper region
+
+       subsd	%xmm10,%xmm7					# xmm1 = r=(rhead-rtail)
+
+       movlpd	 %xmm7,r1+8(%rsp)			# store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_lower_naninf_higher
+
+	lea	 region1(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+  	mov	 r1(%rsp),%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sinf_lower_naninf_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.L__vrs4_sinf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+	movhlps %xmm1,%xmm7		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher
+
+	mov	  %r9,p_temp1(%rsp)			#Save upper arg
+	lea	 region1(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm1,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp1(%rsp),%r9			#Restore upper arg
+
+	jmp 	0f
+
+.L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher:				#lower arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher
+
+	lea	 region1+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm7,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp 	.L__vrs4_sinf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm10,%xmm6						; rhead
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	 %xmm10,r(%rsp)
+
+#       subpd	%xmm10,%xmm6						; rr=rhead-r
+#       subpd	xmm6, xmm8						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd	 %xmm1,r1+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r8d				# r8d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r8d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r8d,region1(%rsp)			# store lower region
+
+#       movsd	%xmm7,%xmm1
+#       subsd	xmm1, xmm10					; xmm10 = r=(rhead-rtail)
+#       subsd	%xmm1,%xmm7					; rr=rhead-r
+#       subsd	xmm7, xmm10					; xmm6 = rr=((rhead-r) -rtail)
+
+        subsd	%xmm10,%xmm7					# xmm10 = r=(rhead-rtail)
+
+#       movlpd	QWORD PTR r1[rsp], xmm1				; store upper r
+#       movlpd	QWORD PTR rr1[rsp], xmm7			; store upper rr
+
+        movlpd	 %xmm7,r1(%rsp)				# store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_upper_naninf_higher
+
+	lea	 region1+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r1+8(%rsp),%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sinf_upper_naninf_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp	.L__vrs4_sinf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sinf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	r(%rsp),%xmm10
+	movapd	r1(%rsp),%xmm1
+
+	mov	region(%rsp),%r8
+	mov	region1(%rsp),%r9
+	mov 	.L__reald_one_zero(%rip),%rdx		#compare value for cossin path
+
+	mov 	%r8,%r10
+	mov 	%r9,%r11
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+
+	lea	.Levensin_oddcos_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sinf_cleanup:
+
+	movapd	  p_sign(%rsp),%xmm10
+	movapd	  p_sign1(%rsp),%xmm1
+
+	xorpd	%xmm4,%xmm10			# (+) Sign
+	xorpd	%xmm5,%xmm1			# (+) Sign
+
+	cvtpd2ps %xmm10,%xmm0
+	cvtpd2ps %xmm1,%xmm11
+	movlhps	 %xmm11,%xmm0
+
+	mov	save_r12(%rsp),%r12	# restore r12
+	mov	save_r13(%rsp),%r13	# restore r13
+
+	add	$0x01E8,%rsp
+	ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+	movapd	%xmm2,%xmm0					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0	# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0	;trash r
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	addpd	.Lcosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm2,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd   %xmm0,%xmm4					# + t
+	subpd   %xmm11,%xmm5					# + t
+
+	jmp 	.L__vrs4_sinf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4		# s4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5		# s4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8		# s2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4		# s4+x2s3
+	addpd	.Lsincosarray+0x20(%rip),%xmm5		# s4+x2s3
+	addpd	.Lsincosarray(%rip),%xmm8		# s2+x2s1
+	addpd	.Lsincosarray(%rip),%xmm9		# s2+x2s1
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm0,%xmm0				# move high x4 for cos term
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	%xmm10,%xmm6				# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos
+	movhlps	%xmm3,%xmm13				# move high r for cos
+
+	movhlps	%xmm4,%xmm8				# xmm4 = sin , xmm8 = cos
+	movhlps	%xmm5,%xmm9				# xmm4 = sin , xmm8 = cos
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm7,%xmm5				# sin *x3
+
+	mulsd	%xmm0,%xmm8				# cos *x4
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	#-t=r-1.0
+
+	addsd	%xmm10,%xmm4				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+	subsd   %xmm12,%xmm8				# cos+t
+	subsd   %xmm13,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrs4_sinf_cleanup
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:
+
+	movapd	.Lsincosarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lcossinarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lsincosarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lcossinarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm3,%xmm7				# sincos term upper x2 for x3
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4		# s3+x2s4
+	addpd	.Lcossinarray+0x20(%rip),%xmm5		# s3+x2s4
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2s2
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2s2
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm0,%xmm0				# move high x4 for cos term
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	mulpd	%xmm1,%xmm7
+
+	mulsd	%xmm10,%xmm6				# get low x3 for sin term (cossin)
+	movhlps	%xmm7,%xmm7				# get high x3 for sin term (sincos)
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos term
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+
+	movhlps	%xmm2,%xmm12				# move high r for cos (cossin)
+
+
+	movhlps	%xmm4,%xmm8				# xmm8 = cos , xmm4 = sin	(cossin)
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos	(sincos)
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm11,%xmm5				# cos *x4
+	mulsd	%xmm0,%xmm8				# cos *x4
+	mulsd	%xmm7,%xmm9				# sin *x3
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	# -t=r-1.0
+
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term    (sincos)
+
+	addsd	%xmm10,%xmm4				# sin + x	+
+	addsd	%xmm11,%xmm9				# sin + x	+
+
+	subsd   %xmm12,%xmm8				# cos+t
+	subsd   %xmm3,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrs4_sinf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	.Lcossinarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lcossinarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lcossinarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lcossinarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm2,%xmm6				# move x2 for x4
+	movapd	%xmm3,%xmm7				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4		# s4+x2s3
+	addpd	.Lcossinarray+0x20(%rip),%xmm5		# s4+x2s3
+	addpd	.Lcossinarray(%rip),%xmm8		# s2+x2s1
+	addpd	.Lcossinarray(%rip),%xmm9		# s2+x2s1
+
+	mulpd	%xmm0,%xmm4				# x4(s4+x2s3)
+	mulpd	%xmm11,%xmm5				# x4(s4+x2s3)
+
+	mulpd	%xmm10,%xmm6				# get low x3 for sin term
+	mulpd	%xmm1,%xmm7				# get low x3 for sin term
+	movhlps	%xmm6,%xmm6				# move low x2 for x3 for sin term
+	movhlps	%xmm7,%xmm7				# move low x2 for x3 for sin term
+
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos terms
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm4,%xmm12				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm13				# xmm9 = sin , xmm5 = cos
+
+	mulsd	%xmm6,%xmm12				# sin *x3
+	mulsd	%xmm7,%xmm13				# sin *x3
+	mulsd	%xmm0,%xmm4				# cos *x4
+	mulsd	%xmm11,%xmm5				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	#-t=r-1.0
+
+	movhlps	%xmm10,%xmm0				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term
+							# Reverse 10 and 0
+
+	addsd	%xmm0,%xmm12				# sin + x
+	addsd	%xmm11,%xmm13				# sin + x
+
+	subsd   %xmm2,%xmm4				# cos+t
+	subsd   %xmm3,%xmm5				# cos+t
+
+	movlhps	%xmm12,%xmm4
+	movlhps	%xmm13,%xmm5
+	jmp 	.L__vrs4_sinf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+
+	movapd	.Lcossinarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lsincosarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lcossinarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lsincosarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm2,%xmm7				# upper x2 for x3 for sin term (sincos)
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4		# s3+x2s4
+	addpd	.Lsincosarray+0x20(%rip),%xmm5		# s3+x2s4
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2s2
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2s2
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+
+	movsd	%xmm3,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	mulpd	%xmm10,%xmm7
+
+	mulsd	%xmm1,%xmm6				# get low x3 for sin term (cossin)
+	movhlps	%xmm7,%xmm7				# get high x3 for sin term (sincos)
+
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos term
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+
+	movhlps	%xmm3,%xmm12				# move high r for cos (cossin)
+
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos	(sincos)
+	movhlps	%xmm5,%xmm9				# xmm9 = cos , xmm5 = sin	(cossin)
+
+	mulsd	%xmm0,%xmm4				# cos *x4
+	mulsd	%xmm6,%xmm5				# sin *x3
+	mulsd	%xmm7,%xmm8				# sin *x3
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	# -t=r-1.0
+
+	movhlps	%xmm10,%xmm11				# move high x for x for sin term    (sincos)
+
+	subsd	%xmm2,%xmm4				# cos-(-t)
+	subsd	%xmm12,%xmm9				# cos-(-t)
+
+	addsd   %xmm11,%xmm8				# sin + x
+	addsd   %xmm1,%xmm5				# sin + x
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrs4_sinf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: 	SIN
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr:	COS
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm0					# x2	; SIN
+	movapd	%xmm3,%xmm11					# x2	; COS
+	movapd	%xmm3,%xmm1					# copy of x2 for x4
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm0					# x4
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm3,%xmm1					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+
+	mulpd	%xmm0,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm1,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm1,%xmm5					# x4 * zc
+
+	addpd	%xmm10,%xmm4					# +x
+	subpd   %xmm11,%xmm5					# +t
+
+	jmp 	.L__vrs4_sinf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: 	COS
+# p_sign1  = Sign, xmm1  = r, xmm3 = %xmm7,%r2 =rr:	SIN
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm0					# x2	; COS
+	movapd	%xmm3,%xmm11					# x2	; SIN
+	movapd	%xmm2,%xmm10					# copy of x2 for x4
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# s4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# s2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# s4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# s2*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# s3+x2c4
+	addpd	.Lcosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9			# s1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm10,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm11,%xmm5					# x4(s3+x2s4)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm10,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x3 * zc
+
+	subpd	%xmm0,%xmm4					# +t
+	addpd   %xmm1,%xmm5					# +x
+
+	jmp 	.L__vrs4_sinf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4:				#Derive from cossin_coscos
+	movhlps	%xmm2,%xmm0					# x2 for 0.5x2 for upper cos
+	movsd	%xmm2,%xmm6					# lower x2 for x3 for lower sin
+	movapd	%xmm3,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0	# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					#  t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lsincosarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcosarray(%rip),%xmm9			# c2+x2c1
+
+
+	movapd	%xmm12,%xmm2					# upper=x4
+	movsd	%xmm6,%xmm2					# lower=x2
+	mulsd	%xmm10,%xmm2					# lower=x2*x
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# upper= x4 * zc
+								# lower=x3 * zs
+	mulpd	%xmm13,%xmm5					# x4 * zc
+
+
+	movlhps	%xmm7,%xmm10					#
+	addpd	%xmm10,%xmm4					# +x for lower sin, +t for upper cos
+	subpd   %xmm11,%xmm5					# -(-t)
+
+	jmp 	.L__vrs4_sinf_cleanup
+.align 16
+.Lcoscos_sincos_piby4:				#Derive from cossin_coscos
+	movsd	%xmm2,%xmm0					# x2 for 0.5x2 for lower cos
+	movapd	%xmm3,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcossinarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcossinarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcossinarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcosarray(%rip),%xmm9				# c2+x2c1
+
+	mulpd	%xmm10,%xmm2					# upper=x3 for sin
+	mulsd	%xmm10,%xmm2					# lower=x4 for cos
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# lower= x4 * zc
+								# upper= x3 * zs
+	mulpd	%xmm13,%xmm5					# x4 * zc
+
+
+	movsd	%xmm7,%xmm10
+	addpd	%xmm10,%xmm4					# +x for upper sin, +t for lower cos
+	subpd   %xmm11,%xmm5					# -(-t)
+
+	jmp 	.L__vrs4_sinf_cleanup
+.align 16
+.Lcossin_coscos_piby4:
+	movhlps	%xmm3,%xmm0					# x2 for 0.5x2 for upper cos
+	movapd	%xmm2,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	%xmm3,%xmm6					# lower x2 for x3 for sin
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9			# c2
+
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lsincosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcosarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lsincosarray(%rip),%xmm9			# c2+x2c1
+
+	movapd	%xmm13,%xmm3					# upper=x4
+	movsd	%xmm6,%xmm3					# lower x2
+	mulsd	%xmm1,%xmm3					# lower x2*x
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm12,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# upper= x4 * zc
+								# lower=x3 * zs
+
+	movlhps	%xmm7,%xmm1
+	addpd	%xmm1,%xmm5					# +x for lower sin, +t for upper cos
+	subpd   %xmm11,%xmm4					# -(-t)
+
+	jmp 	.L__vrs4_sinf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:		# Derived from sincos_coscos
+
+	movhlps	%xmm3,%xmm0					# x2
+	movapd	%xmm3,%xmm7
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsincosarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsincosarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+	movapd	%xmm13,%xmm3					# upper x4 for cos
+	movsd	%xmm7,%xmm3					# lower x2 for sin
+	mulsd	%xmm1,%xmm3					# lower x3=x2*x for sin
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm11,%xmm1					# t for upper cos and x for lower sin
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# upper=x4 * zc
+								# lower=x3 * zs
+
+	addpd   %xmm10,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +t upper, +x lower
+
+
+	jmp 	.L__vrs4_sinf_cleanup
+.align 16
+.Lsincos_coscos_piby4:
+	movsd	%xmm3,%xmm0					# x2 for 0.5x2 for lower cos
+	movapd	%xmm2,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lcossinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lcossinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0	# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcossinarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcosarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcossinarray(%rip),%xmm9			# c2+x2c1
+
+	mulpd	%xmm1,%xmm3					# upper=x3 for sin
+	mulsd	%xmm1,%xmm3					# lower=x4 for cos
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm12,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# lower= x4 * zc
+								# upper= x3 * zs
+
+	movsd	%xmm7,%xmm1
+	subpd   %xmm11,%xmm4					# -(-t)
+	addpd	%xmm1,%xmm5					# +x for upper sin, +t for lower cos
+
+
+	jmp 	.L__vrs4_sinf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4:		# Derived from sincos_coscos
+
+	movsd	%xmm3,%xmm0					# x2
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcossinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcossinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcossinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8				# c1+x2c2
+	addpd	.Lcossinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm1,%xmm6
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm6,%xmm11					# upper =t ; lower =x
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	addpd   %xmm10,%xmm4					# +x
+	addpd	%xmm11,%xmm5					# +t lower, +x upper
+
+	jmp 	.L__vrs4_sinf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4:		# Derived from sincos_coscos
+
+	movhlps	%xmm2,%xmm0					# x2
+	movapd	%xmm2,%xmm7
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsincosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	movapd	%xmm12,%xmm2					# upper x4 for cos
+	movsd	%xmm7,%xmm2					# lower x2 for sin
+	mulsd	%xmm10,%xmm2					# lower x3=x2*x for sin
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm11,%xmm10					# t for upper cos and x for lower sin
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# upper=x4 * zc
+								# lower=x3 * zs
+
+	addpd	%xmm1,%xmm5					# +x
+	addpd   %xmm10,%xmm4					# +t upper, +x lower
+
+	jmp 	.L__vrs4_sinf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:		# Derived from sincos_coscos
+
+	movsd	%xmm2,%xmm0					# x2
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lcossinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcossinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lcossinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm10,%xmm2					# upper x3 for sin
+	mulsd	%xmm10,%xmm2					# lower x4 for cos
+
+	movhlps	%xmm10,%xmm6
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm6,%xmm11
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	addpd	%xmm1,%xmm5					# +x
+	addpd   %xmm11,%xmm4					# +t lower, +x upper
+
+	jmp 	.L__vrs4_sinf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+  #x2 = x * x;
+  #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))));
+
+  #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4));
+
+
+	movapd	%xmm2,%xmm0					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+
+	mulpd	%xmm2,%xmm0					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	mulpd	%xmm10,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# x3
+
+	addpd	.Lsinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm0,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm11,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# x3 * zs
+
+	addpd	%xmm10,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+
+	jmp 	.L__vrs4_sinf_cleanup

diff --git a/src/gas/vrs8expf.S b/src/gas/vrs8expf.S
new file mode 100644
index 0000000..b2eb597
--- /dev/null
+++ b/src/gas/vrs8expf.S

@@ -0,0 +1,618 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs8expf.s
+#
+# A vector implementation of the expf libm function.
+#
+# Prototype:
+#
+#    void vs_expf(int n, float *x, float *y);
+#
+#   Computes e raised to the x power for a eight packed single values.
+#   Places the results into xmm0 an xmm1.
+#  This routine implemented in single precision.  It is slightly
+#  less accurate than the double precision version, but it will
+#  be better for vectorizing.
+# Does not perform error handling, but does return C99 values for error
+# inputs.   Denormal results are truncated to 0.
+
+# This array version is basically a unrolling of the by4 scalar single
+# routine.  The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+# The scheduling is done by trial and error.  The resulting code represents
+# the best time of many variations.  It would seem more interleaving could
+# be done, as there is a long stretch of the second computation that is not
+# interleaved.  But moving any of this code forward makes the routine
+# slower.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ    p_ux,0x00               #qword
+.equ    p_ux2,0x010             #qword
+
+.equ    save_xa,0x020           #qword
+.equ    save_ya,0x028           #qword
+.equ    save_nv,0x030           #qword
+
+
+.equ    p_iter,0x038            # qword storage for number of loop iterations
+
+.equ    p_j,0x040               # second temporary for get/put bits operation
+.equ    p_m,0x050               #qword
+.equ    p_j2,0x060              # second temporary for exponent multiply
+.equ    p_m2,0x070              #qword
+.equ    save_rbx,0x080          #qword
+
+
+.equ    stack_size,0x098
+
+
+# parameters passed by gcc as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrs8_expf
+    .type   __vrs8_expf,@function
+__vrs8_expf:
+        sub             $stack_size,%rsp
+        mov             %rbx,save_rbx(%rsp)
+
+
+# Process the array 8 values at a time.
+
+	movaps	.L__real_thirtytwo_by_log2(%rip),%xmm3	#
+
+	movaps	 %xmm0,p_ux(%rsp)
+        maxps   .L__real_m8192(%rip),%xmm0
+		movaps	 %xmm1,p_ux2(%rsp)
+                maxps   .L__real_m8192(%rip),%xmm1
+		movaps	%xmm1,%xmm6
+#        /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+#      Step 1. Reduce the argument.
+        #    r = x * thirtytwo_by_logbaseof2;
+        movaps  .L__real_thirtytwo_by_log2(%rip),%xmm2  #
+
+        mulps   %xmm0,%xmm2
+        xor             %rax,%rax
+        minps   .L__real_8192(%rip),%xmm2
+                movaps  .L__real_thirtytwo_by_log2(%rip),%xmm5  #
+
+                mulps   %xmm6,%xmm5
+        	minps   .L__real_8192(%rip),%xmm5   # protect against large input values
+
+#    /* Set n = nearest integer to r */
+        cvtps2dq        %xmm2,%xmm3
+        lea             .L__two_to_jby32_table(%rip),%rdi
+        cvtdq2ps        %xmm3,%xmm1
+
+                cvtps2dq        %xmm5,%xmm8
+                cvtdq2ps        %xmm8,%xmm7
+#    r1 = x - n * logbaseof2_by_32_lead;
+        movaps  .L__real_log2_by_32_head(%rip),%xmm2
+        mulps   %xmm1,%xmm2
+        subps   %xmm2,%xmm0                             # r1 in xmm0,
+
+                movaps  .L__real_log2_by_32_head(%rip),%xmm5
+                mulps   %xmm7,%xmm5
+                subps   %xmm5,%xmm6                             # r1 in xmm6,
+
+
+#    r2 = - n * logbaseof2_by_32_lead;
+        mulps   .L__real_log2_by_32_tail(%rip),%xmm1
+                mulps   .L__real_log2_by_32_tail(%rip),%xmm7
+
+#    j = n & 0x0000001f;
+        movdqa  %xmm3,%xmm4
+        movdqa  .L__int_mask_1f(%rip),%xmm2
+                movdqa  %xmm8,%xmm9
+                movdqa  .L__int_mask_1f(%rip),%xmm5
+        pand    %xmm4,%xmm2
+        movdqa   %xmm2,p_j(%rsp)
+#    f1 = two_to_jby32_lead_table[j);
+
+                pand    %xmm9,%xmm5
+                movdqa   %xmm5,p_j2(%rsp)
+
+#    *m = (n - j) / 32;
+        psubd   %xmm2,%xmm4
+        psrad   $5,%xmm4
+        movdqa   %xmm4,p_m(%rsp)
+                psubd   %xmm5,%xmm9
+                psrad   $5,%xmm9
+                movdqa   %xmm9,p_m2(%rsp)
+
+        movaps  %xmm0,%xmm3
+        addps   %xmm1,%xmm3                             # r = r1+ r2
+
+        mov             p_j(%rsp),%eax                  # get an individual index
+
+                movaps  %xmm6,%xmm8
+        mov             (%rdi,%rax,4),%edx              # get the f1 value
+                addps   %xmm7,%xmm8                             # r = r1+ r2
+        mov              %edx,p_j(%rsp)                         # save the f1 value
+
+
+#      Step 2. Compute the polynomial.
+#    q = r1 +
+#              r*r*( 5.00000000000000008883e-01 +
+#                      r*( 1.66666666665260878863e-01 +
+#                      r*( 4.16666666662260795726e-02 +
+#                      r*( 8.33336798434219616221e-03 +
+#                      r*( 1.38889490863777199667e-03 )))));
+#    q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+#    q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision
+        movaps  %xmm3,%xmm4
+        movaps  %xmm3,%xmm2
+        mulps   %xmm2,%xmm2                     # x*x
+        mulps   .L__real_1_24(%rip),%xmm4       # /24
+
+        mov             p_j+4(%rsp),%eax                        # get an individual index
+
+        mov             (%rdi,%rax,4),%edx              # get the f1 value
+        mov              %edx,p_j+4(%rsp)                       # save the f1 value
+
+
+        addps   .L__real_1_6(%rip),%xmm4                # +1/6
+
+        mulps   %xmm2,%xmm3                     # x^3
+        mov             p_j+8(%rsp),%eax                        # get an individual index
+
+        mov             (%rdi,%rax,4),%edx              # get the f1 value
+        mov              %edx,p_j+8(%rsp)                       # save the f1 value
+
+        mulps   .L__real_half(%rip),%xmm2       # x^2/2
+        mov             p_j+12(%rsp),%eax                       # get an individual index
+
+        mov             (%rdi,%rax,4),%edx              # get the f1 value
+        mov              %edx,p_j+12(%rsp)                      # save the f1 value
+
+        mulps   %xmm3,%xmm4                     # *x^3
+                mov             p_j2(%rsp),%eax                 # get an individual index
+                mov             (%rdi,%rax,4),%edx              # get the f1 value
+                mov              %edx,p_j2(%rsp)                        # save the f1 value
+
+
+        addps   %xmm4,%xmm1                     # +r2
+
+        addps   %xmm2,%xmm1                     # + x^2/2
+        addps   %xmm1,%xmm0                     # +r1
+
+                movaps  %xmm8,%xmm9
+                mov             p_j2+4(%rsp),%eax                # get an individual index
+                movaps  %xmm8,%xmm5
+                mulps   %xmm5,%xmm5                     # x*x
+                mulps   .L__real_1_24(%rip),%xmm9       # /24
+
+                movaps  %xmm8,%xmm5
+                mulps   %xmm5,%xmm5                     # x*x
+                mulps   .L__real_1_24(%rip),%xmm9       # /24
+
+                mov             (%rdi,%rax,4),%edx              # get the f1 value
+                mov              %edx,p_j2+4(%rsp)                      # save the f1 value
+
+
+# deal with infinite or denormal results
+        movdqa  p_m(%rsp),%xmm1
+        movdqa  p_m(%rsp),%xmm2
+        pcmpgtd .L__int_127(%rip),%xmm2
+        pminsw  .L__int_128(%rip),%xmm1 # ceil at 128
+        movmskps        %xmm2,%eax
+        test            $0x0f,%eax
+
+        paddd   .L__int_127(%rip),%xmm1 # add bias
+
+#    *z2 = f2 + ((f1 + f2) * q);
+        mulps   p_j(%rsp),%xmm0         # * f1
+        addps   p_j(%rsp),%xmm0         # + f1
+        jnz             .L__exp_largef
+.L__check1:
+
+        pxor    %xmm2,%xmm2                             # floor at 0
+        pmaxsw  %xmm2,%xmm1
+
+        pslld   $23,%xmm1                                       # build 2^n
+
+        movaps  %xmm1,%xmm2
+
+
+
+# check for infinity or nan
+        movaps  p_ux(%rsp),%xmm1
+        andps   .L__real_infinity(%rip),%xmm1
+        cmpps   $0,.L__real_infinity(%rip),%xmm1
+        movmskps        %xmm1,%ebx
+        test            $0x0f,%ebx
+
+# end of splitexp
+#        /* Scale (z1 + z2) by 2.0**m */
+#      Step 3. Reconstitute.
+
+	mulps	%xmm2,%xmm0						# result *= 2^n
+
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them.  But it adds cycles for normal cases
+# to handle events that are supposed to be exceptions.
+#  Using this branch with the
+# check above results in faster code for the normal cases.
+# And branch mispredict penalties should only come into
+# play for nans and infinities.
+	jnz		.L__exp_naninf
+.L__vsa_bottom1:
+
+                #    q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision
+                addps   .L__real_1_6(%rip),%xmm9                # +1/6
+
+                mulps   %xmm5,%xmm8                     # x^3
+                mov             p_j2+8(%rsp),%eax               # get an individual index
+                mov             (%rdi,%rax,4),%edx              # get the f1 value
+                mov              %edx,p_j2+8(%rsp)              # save the f1 value
+
+                mulps   .L__real_half(%rip),%xmm5       # x^2/2
+                mulps   %xmm8,%xmm9                     # *x^3
+
+                mov             p_j2+12(%rsp),%eax              # get an individual index
+                mov             (%rdi,%rax,4),%edx              # get the f1 value
+                mov              %edx,p_j2+12(%rsp)             # save the f1 value
+
+                addps   %xmm9,%xmm7                     # +r2
+
+                addps   %xmm5,%xmm7                     # + x^2/2
+                addps   %xmm7,%xmm6                     # +r1
+
+
+                # deal with infinite or denormal results
+                movdqa  p_m2(%rsp),%xmm7
+                movdqa  p_m2(%rsp),%xmm5
+                pcmpgtd .L__int_127(%rip),%xmm5
+                pminsw  .L__int_128(%rip),%xmm7 # ceil at 128
+                movmskps        %xmm5,%eax
+                test            $0x0f,%eax
+
+                paddd   .L__int_127(%rip),%xmm7 # add bias
+
+        #    *z2 = f2 + ((f1 + f2) * q);
+                mulps   p_j2(%rsp),%xmm6                # * f1
+                addps   p_j2(%rsp),%xmm6                # + f1
+                jnz             .L__exp_largef2
+.L__check2:
+                pxor    %xmm1,%xmm1                             # floor at 0
+                pmaxsw  %xmm1,%xmm7
+
+                pslld   $23,%xmm7                                       # build 2^n
+
+                movaps  %xmm7,%xmm1
+
+
+                # check for infinity or nan
+                movaps  p_ux2(%rsp),%xmm7
+                andps   .L__real_infinity(%rip),%xmm7
+                cmpps   $0,.L__real_infinity(%rip),%xmm7
+                movmskps        %xmm7,%ebx
+                test            $0x0f,%ebx
+
+
+		# end of splitexp
+		#        /* Scale (z1 + z2) by 2.0**m */
+		#      Step 3. Reconstitute.
+
+		mulps	%xmm6,%xmm1						# result *= 2^n
+
+		jnz			.L__exp_naninf2
+
+.L__vsa_bottom2:
+
+
+
+#
+.L__final_check:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+# at least one of the numbers needs special treatment
+.L__exp_naninf:
+	lea		p_ux(%rsp),%rcx
+	lea		p_j(%rsp),%rsi
+	call  .L__fexp_naninf
+	jmp		.L__vsa_bottom1
+.L__exp_naninf2:
+	lea		p_ux2(%rsp),%rcx
+	lea		p_j(%rsp),%rsi
+	movaps	%xmm0,%xmm2
+	movaps	%xmm1,%xmm0
+	call  .L__fexp_naninf
+	movaps	%xmm0,%xmm1
+	movaps	%xmm2,%xmm0
+	jmp		.L__vsa_bottom2
+
+#  deal with nans and infinities
+# This subroutine checks a packed single for nans and infinities and
+# produces the proper result from the exceptional inputs
+# Register assumptions:
+# Inputs:
+# rbx - mask of errors
+# xmm0 - computed result vector
+# Outputs:
+# xmm0 - new result vector
+# %rax,rdx,rbx,%xmm2 all modified.
+
+.L__fexp_naninf:
+	sub		$0x018,%rsp
+	movaps	%xmm0,(%rsi)	# save the computed values
+	test	$1,%ebx					# first value?
+	jz		.L__Lni2
+	mov		0(%rcx),%edx	# get the input
+	call	.L__naninf
+	mov		%edx,0(%rsi)	# copy the result
+.L__Lni2:
+	test	$2,%ebx					# second value?
+	jz		.L__Lni3
+	mov		4(%rcx),%edx	# get the input
+	call	.L__naninf
+	mov		%edx,4(%rsi)	# copy the result
+.L__Lni3:
+	test	$4,%ebx					# third value?
+	jz		.L__Lni4
+	mov		8(%rcx),%edx	# get the input
+	call	.L__naninf
+	mov		%edx,8(%rsi)	# copy the result
+.L__Lni4:
+	test	$8,%ebx					# fourth value?
+	jz		.L__Lnie
+	mov		12(%rcx),%edx	# get the input
+	call	.L__naninf
+	mov		%edx,12(%rsi)	# copy the result
+.L__Lnie:
+	movaps	(%rsi),%xmm0	# get the answers
+	add		$0x018,%rsp
+	ret
+
+#
+# a simple subroutine to check a scalar input value for infinity
+# or NaN and return the correct result
+# expects input in .Land,%edx returns value in edx.  Destroys eax.
+.L__naninf:
+	mov		$0x0007FFFFF,%eax
+	test	%eax,%edx
+	jnz		.L__enan		# jump if mantissa not zero, so it's a NaN
+# inf
+	mov		%edx,%eax
+	rcl		$1,%eax
+	jnc		.L__r			# exp(+inf) = inf
+	xor		%edx,%edx		# exp(-inf) = 0
+	jmp		.L__r
+
+#NaN
+.L__enan:
+	mov		$0x000400000,%eax	# convert to quiet
+	or		%eax,%edx
+.L__r:
+	ret
+        .align  16
+#  deal with m > 127.  In some instances, rounding during calculations
+#  can result in infinity when it shouldn't.  For these cases, we scale
+#  m down, and scale the mantissa up.
+
+.L__exp_largef:
+        movdqa    %xmm0,p_j(%rsp)    # save the mantissa portion
+        movdqa    %xmm1,p_m(%rsp)       # save the exponent portion
+        mov       %eax,%ecx              # save the error mask
+        test    $1,%ecx                  # first value?
+        jz       .L__Lf2
+        mov      p_m(%rsp),%edx # get the exponent
+        sub      $1,%edx                # scale it down
+        mov      %edx,p_m(%rsp)       # save the exponent
+        movss   p_j(%rsp),%xmm3     # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss    %xmm3,p_j(%rsp)   # save the mantissa
+.L__Lf2:
+        test    $2,%ecx                 # second value?
+        jz       .L__Lf3
+        mov      p_m+4(%rsp),%edx # get the exponent
+        sub      $1,%edx                # scale it down
+        mov      %edx,p_m+4(%rsp)       # save the exponent
+        movss   p_j+4(%rsp),%xmm3     # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+4(%rsp)   # save the mantissa
+.L__Lf3:
+        test    $4,%ecx                 # third value?
+        jz       .L__Lf4
+        mov      p_m+8(%rsp),%edx # get the exponent
+        sub      $1,%edx                # scale it down
+        mov      %edx,p_m+8(%rsp)       # save the exponent
+        movss   p_j+8(%rsp),%xmm3     # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss    %xmm3,p_j+8(%rsp)   # save the mantissa
+.L__Lf4:
+        test    $8,%ecx                                 # fourth value?
+        jz       .L__Lfe
+        mov      p_m+12(%rsp),%edx        # get the exponent
+        sub      $1,%edx                 # scale it down
+        mov      %edx,p_m+12(%rsp)      # save the exponent
+        movss   p_j+12(%rsp),%xmm3    # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+12(%rsp)  # save the mantissa
+.L__Lfe:
+        movaps  p_j(%rsp),%xmm0      # restore the mantissa portion back
+        movdqa  p_m(%rsp),%xmm1         # restore the exponent portion
+        jmp             .L__check1
+
+        .align  16
+
+.L__exp_largef2:
+        movdqa    %xmm6,p_j(%rsp)    # save the mantissa portion
+        movdqa    %xmm7,p_m2(%rsp)      # save the exponent portion
+        mov             %eax,%ecx                                       # save the error mask
+        test    $1,%ecx                                 # first value?
+        jz              .L__Lf22
+        mov             p_m2+0(%rsp),%edx       # get the exponent
+        sub             $1,%edx                                         # scale it down
+        mov               %edx,p_m2+0(%rsp)     # save the exponent
+        movss   p_j+0(%rsp),%xmm3    # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+0(%rsp)  # save the mantissa
+.L__Lf22:
+        test    $2,%ecx                                 # second value?
+        jz              .L__Lf32
+        mov             p_m2+4(%rsp),%edx       # get the exponent
+        sub             $1,%edx                                         # scale it down
+        mov               %edx,p_m2+4(%rsp)     # save the exponent
+        movss   p_j+4(%rsp),%xmm3    # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+4(%rsp)  # save the mantissa
+.L__Lf32:
+        test    $4,%ecx                                 # third value?
+        jz              .L__Lf42
+        mov             p_m2+8(%rsp),%edx       # get the exponent
+        sub             $1,%edx                                         # scale it down
+        mov               %edx,p_m2+8(%rsp)     # save the exponent
+        movss   p_j+8(%rsp),%xmm3    # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+8(%rsp)  # save the mantissa
+.L__Lf42:
+        test    $8,%ecx                                 # fourth value?
+        jz              .L__Lfe2
+        mov             p_m2+12(%rsp),%edx      # get the exponent
+        sub             $1,%edx                                         # scale it down
+        mov               %edx,p_m2+12(%rsp)    # save the exponent
+        movss   p_j+12(%rsp),%xmm3   # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+12(%rsp) # save the mantissa
+.L__Lfe2:
+        movaps  p_j(%rsp),%xmm6      # restore the mantissa portion back
+        movdqa  p_m2(%rsp),%xmm7                # restore the exponent portion
+        jmp             .L__check2
+
+	.data		# MUCH better performance without this on my tests
+        .align  64
+.L__real_half:                  .long 0x03f000000       # 1/2
+                                .long 0x03f000000
+                                .long 0x03f000000
+                                .long 0x03f000000
+.L__real_two:                   .long 0x40000000        # 2
+                                .long 0x40000000
+                                .long 0x40000000
+                                .long 0x40000000
+
+.L__real_8192:                  .long 0x46000000        # 8192, to protect against really large numbers
+                                .long 0x46000000
+                                .long 0x46000000
+                                .long 0x46000000
+.L__real_m8192:                 .long 0xC6000000        # -8192, to protect against really small numbers
+                                .long 0xC6000000
+                                .long 0xC6000000
+                                .long 0xC6000000
+.L__real_thirtytwo_by_log2:     .long 0x04238AA3B       # thirtytwo_by_log2
+                                .long 0x04238AA3B
+                                .long 0x04238AA3B
+                                .long 0x04238AA3B
+.L__real_log2_by_32:            .long 0x03CB17218       # log2_by_32
+                                .long 0x03CB17218
+                                .long 0x03CB17218
+                                .long 0x03CB17218
+.L__real_log2_by_32_head:       .long 0x03CB17000       # log2_by_32
+                                .long 0x03CB17000
+                                .long 0x03CB17000
+                                .long 0x03CB17000
+.L__real_log2_by_32_tail:       .long 0x0B585FDF4       # log2_by_32
+                                .long 0x0B585FDF4
+                                .long 0x0B585FDF4
+                                .long 0x0B585FDF4
+.L__real_1_6:                   .long 0x03E2AAAAB       # 0.16666666666 used in polynomial
+                                .long 0x03E2AAAAB
+                                .long 0x03E2AAAAB
+                                .long 0x03E2AAAAB
+.L__real_1_24:                  .long 0x03D2AAAAB       # 0.041666668 used in polynomial
+                                .long 0x03D2AAAAB
+                                .long 0x03D2AAAAB
+                                .long 0x03D2AAAAB
+.L__real_1_120:                 .long 0x03C088889       # 0.0083333338 used in polynomial
+                                .long 0x03C088889
+                                .long 0x03C088889
+                                .long 0x03C088889
+.L__real_infinity:              .long 0x07f800000       # infinity
+                                .long 0x07f800000
+                                .long 0x07f800000
+                                .long 0x07f800000
+.L__int_mask_1f:                .long 0x00000001f
+                                .long 0x00000001f
+                                .long 0x00000001f
+                                .long 0x00000001f
+.L__int_128:                    .long 0x000000080
+                                .long 0x000000080
+                                .long 0x000000080
+                                .long 0x000000080
+.L__int_127:                    .long 0x00000007f
+                                .long 0x00000007f
+                                .long 0x00000007f
+                                .long 0x00000007f
+
+.L__two_to_jby32_table:
+        .long   0x03F800000             # 1.0000000000000000
+        .long   0x03F82CD87             # 1.0218971486541166
+        .long   0x03F85AAC3             # 1.0442737824274138
+        .long   0x03F88980F             # 1.0671404006768237
+        .long   0x03F8B95C2             # 1.0905077326652577
+        .long   0x03F8EA43A             # 1.1143867425958924
+        .long   0x03F91C3D3             # 1.1387886347566916
+        .long   0x03F94F4F0             # 1.1637248587775775
+        .long   0x03F9837F0             # 1.1892071150027210
+        .long   0x03F9B8D3A             # 1.2152473599804690
+        .long   0x03F9EF532             # 1.2418578120734840
+        .long   0x03FA27043             # 1.2690509571917332
+        .long   0x03FA5FED7             # 1.2968395546510096
+        .long   0x03FA9A15B             # 1.3252366431597413
+        .long   0x03FAD583F             # 1.3542555469368927
+        .long   0x03FB123F6             # 1.3839098819638320
+        .long   0x03FB504F3             # 1.4142135623730951
+        .long   0x03FB8FBAF             # 1.4451808069770467
+        .long   0x03FBD08A4             # 1.4768261459394993
+        .long   0x03FC12C4D             # 1.5091644275934228
+        .long   0x03FC5672A             # 1.5422108254079407
+        .long   0x03FC9B9BE             # 1.5759808451078865
+        .long   0x03FCE248C             # 1.6104903319492543
+        .long   0x03FD2A81E             # 1.6457554781539649
+        .long   0x03FD744FD             # 1.6817928305074290
+        .long   0x03FDBFBB8             # 1.7186192981224779
+        .long   0x03FE0CCDF             # 1.7562521603732995
+        .long   0x03FE5B907             # 1.7947090750031072
+        .long   0x03FEAC0C7             # 1.8340080864093424
+        .long   0x03FEFE4BA             # 1.8741676341103000
+        .long   0x03FF5257D             # 1.9152065613971474
+        .long   0x03FFA83B3             # 1.9571441241754002
+        .long 0                                 # for alignment
+

diff --git a/src/gas/vrs8log10f.S b/src/gas/vrs8log10f.S
new file mode 100644
index 0000000..b0a2a67
--- /dev/null
+++ b/src/gas/vrs8log10f.S

@@ -0,0 +1,967 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs8logf.s
+#
+# A vector implementation of the logf libm function.
+#  This routine implemented in single precision.  It is slightly
+#  less accurate than the double precision version, but it will
+#  be better for vectorizing.
+#
+# Prototype:
+#
+#    __m128,__m128 __vrs8_log10f(__m128 x1, __m128 x2);
+#
+#   Computes the natural log of x for eight packed single values.
+#   Places the results into xmm0 and xmm1.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.
+#
+# This array version is basically a unrolling of the by4 scalar single
+# routine.  The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+# The scheduling is done by trial and error.  The resulting code represents
+# the best time of many variations.  It would seem more interleaving could
+# be done, as there is a long stretch of the second computation that is not
+# interleaved.  But moving any of this code forward makes the routine
+# slower.
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+# define local variable storage offsets
+.equ	p_x,0			# save x
+.equ	p_idx,0x010		# xmmword index
+.equ	p_z1,0x020		# xmmword index
+.equ	p_q,0x030		# xmmword index
+.equ	p_corr,0x040		# xmmword index
+.equ	p_omask,0x050		# xmmword index
+.equ	save_xmm6,0x060		#
+.equ	save_rbx,0x070		#
+.equ	save_xmm7,0x080		#
+.equ	save_xmm8,0x090		#
+.equ	save_xmm9,0x0a0		#
+.equ	save_xmm10,0x0b0		#
+.equ	save_xmm11,0x0c0		#
+.equ	save_xmm12,0x0d0		#
+.equ	save_xmm13,0x0d0		#
+.equ	p_x2,0x0100		# save x
+.equ	p_idx2,0x0110		# xmmword index
+.equ	p_z12,0x0120		# xmmword index
+.equ	p_q2,0x0130		# xmmword index
+
+.equ	stack_size,0x0168
+
+
+
+.globl __vrs8_log10f
+    .type   __vrs8_log10f,@function
+__vrs8_log10f:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+
+# check e as a special case
+	movdqa	%xmm0,p_x(%rsp)	# save x
+		movdqa	%xmm1,p_x2(%rsp)	# save x
+#	movdqa	%xmm0,%xmm2
+#	cmpps	$0,.L__real_ef(%rip),%xmm2
+#	movmskps	%xmm2,%r9d
+
+		movdqa	%xmm1,%xmm12
+		movdqa	%xmm1,%xmm9
+		movaps	%xmm1,%xmm7
+
+#
+# compute the index into the log tables
+#
+	movdqa	%xmm0,%xmm3
+	movaps	%xmm0,%xmm1
+	psrld	$23,%xmm3
+
+	#
+	# compute the index into the log tables
+	#
+		psrld	$23,%xmm9
+		subps	.L__real_one(%rip),%xmm7
+		psubd	.L__mask_127(%rip),%xmm9
+	subps	.L__real_one(%rip),%xmm1
+	psubd	.L__mask_127(%rip),%xmm3
+		cvtdq2ps	%xmm9,%xmm13			# xexp
+
+		movdqa	%xmm12,%xmm9
+		pand	.L__real_mant(%rip),%xmm9
+		xor		%r8,%r8
+		movdqa	%xmm9,%xmm8
+		movaps	.L__real_half(%rip),%xmm11							# .5
+	cvtdq2ps	%xmm3,%xmm6			# xexp
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+	xor		%r8,%r8
+	movdqa	%xmm3,%xmm2
+	movaps	.L__real_half(%rip),%xmm5							# .5
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrld	$16,%xmm3
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	movdqa	%xmm3,%xmm4
+		psrld	$16,%xmm9
+		movdqa	%xmm9,%xmm10
+		psrld	$1,%xmm9
+	psrld	$1,%xmm3
+	paddd	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm4
+	paddd	%xmm4,%xmm3
+	cvtdq2ps	%xmm3,%xmm1
+	#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+		paddd	.L__mask_040(%rip),%xmm9
+		pand	.L__mask_001(%rip),%xmm10
+		paddd	%xmm10,%xmm9
+		cvtdq2ps	%xmm9,%xmm7
+	packssdw	%xmm3,%xmm3
+	movq	%xmm3,p_idx(%rsp)
+		packssdw	%xmm9,%xmm9
+		movq	%xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+	movdqa	%xmm0,%xmm3
+	orps		.L__real_half(%rip),%xmm2
+
+
+	mulps	.L__real_3c000000(%rip),%xmm1				# f1 = index/128
+	# reduce and get u
+
+
+	subps	%xmm1,%xmm2											# f2 = f - f1
+	mulps	%xmm2,%xmm5
+	addps	%xmm5,%xmm1
+
+	divps	%xmm1,%xmm2				# u
+
+		movdqa	%xmm12,%xmm9
+		orps		.L__real_half(%rip),%xmm8
+
+
+		mulps	.L__real_3c000000(%rip),%xmm7				# f1 = index/128
+		subps	%xmm7,%xmm8											# f2 = f - f1
+		mulps	%xmm8,%xmm11
+		addps	%xmm11,%xmm7
+
+
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1(%rsp) 			# save the f1 values
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1+8(%rsp) 			# save the f1 value
+		divps	%xmm7,%xmm8				# u
+		lea		.L__np_ln_lead_table(%rip),%rdx
+		mov		p_idx2(%rsp),%rcx 			# get the indexes
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_z12(%rsp) 			# save the f1 values
+
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_z12+8(%rsp) 			# save the f1 value
+# solve for ln(1+u)
+	movaps	%xmm2,%xmm1				# u
+	mulps	%xmm2,%xmm2				# u^2
+	movaps	%xmm2,%xmm5
+	movaps	.L__real_cb3(%rip),%xmm3
+	mulps	%xmm2,%xmm3				#Cu2
+	mulps	%xmm1,%xmm5				# u^3
+	addps	.L__real_cb2(%rip),%xmm3 #B+Cu2
+	movaps	%xmm2,%xmm4
+	mulps	%xmm5,%xmm4				# u^5
+	movaps	.L__real_log2_lead(%rip),%xmm2
+
+	mulps	.L__real_cb1(%rip),%xmm5 #Au3
+	addps	%xmm5,%xmm1				# u+Au3
+	mulps	%xmm3,%xmm4				# u5(B+Cu2)
+
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	addps	%xmm4,%xmm1				# poly
+
+# recombine
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f2 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q(%rsp) 			# save the f2 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+	mov		%cx,%r8w
+	mov		-256(%rdx,%r8,4),%ebx		# get the f2 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q+8(%rsp) 			# save the f2 value
+
+	addps	p_q(%rsp),%xmm1 #z2	+=q
+
+	movaps	p_z1(%rsp),%xmm0			# z1  values
+
+	mulps	%xmm6,%xmm2
+	addps	%xmm2,%xmm0				#r1
+	movaps	%xmm0,%xmm2
+	mulps	.L__real_log2_tail(%rip),%xmm6
+	addps	%xmm6,%xmm1				#r2
+	movaps	%xmm1,%xmm3
+
+#	logef to log10f
+	mulps 	.L__real_log10e_tail(%rip),%xmm1
+	mulps 	.L__real_log10e_tail(%rip),%xmm0
+	mulps 	.L__real_log10e_lead(%rip),%xmm3
+	mulps 	.L__real_log10e_lead(%rip),%xmm2
+	addps 	%xmm1,%xmm0
+	addps 	%xmm3,%xmm0
+	addps	%xmm2,%xmm0
+#	addps	%xmm1,%xmm0
+
+
+
+# check for e
+#	test		$0x0f,%r9d
+#	jnz			.L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+	xorps	%xmm1,%xmm1
+	cmpps	$1,p_x(%rsp),%xmm1	# 0 greater than =?. catches NaNs also.
+	movmskps	%xmm1,%r9d
+	cmp		$0x0f,%r9d
+	jnz		.L__z_or_neg
+
+.L__f2:
+##  if +inf
+	movaps	p_x(%rsp),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test		$0x0f,%r9d
+	jnz		.L__log_inf
+.L__f3:
+
+	movaps	p_x(%rsp),%xmm3
+	subps	.L__real_one(%rip),%xmm3
+	andps	.L__real_notsign(%rip),%xmm3
+	cmpps	$2,.L__real_threshold(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test	$0x0f,%r9d
+	jnz		.L__near_one
+.L__f4:
+
+# finish the second set of calculations
+
+	# solve for ln(1+u)
+		movaps	%xmm8,%xmm7				# u
+		mulps	%xmm8,%xmm8				# u^2
+		movaps	%xmm8,%xmm11
+
+		movaps	.L__real_cb3(%rip),%xmm9
+		mulps	%xmm8,%xmm9				#Cu2
+		mulps	%xmm7,%xmm11				# u^3
+		addps	.L__real_cb2(%rip),%xmm9 #B+Cu2
+		movaps	%xmm8,%xmm10
+		mulps	%xmm11,%xmm10				# u^5
+		movaps	.L__real_log2_lead(%rip),%xmm8
+
+		mulps	.L__real_cb1(%rip),%xmm11 #Au3
+		addps	%xmm11,%xmm7				# u+Au3
+		mulps	%xmm9,%xmm10				# u5(B+Cu2)
+		addps	%xmm10,%xmm7				# poly
+
+
+	# recombine
+		lea		.L__np_ln_tail_table(%rip),%rdx
+		mov		p_idx2(%rsp),%rcx 			# get the indexes
+		mov		%cx,%r8w
+		shr		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+		mov		%cx,%r8w
+		shr		$16,%rcx
+		or		-256(%rdx,%r8,4),%ebx		# get the f2 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_q2(%rsp) 			# save the f2 value
+
+		mov		%cx,%r8w
+		shr		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+		mov		%cx,%r8w
+		mov		-256(%rdx,%r8,4),%ebx		# get the f2 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_q2+8(%rsp) 			# save the f2 value
+		addps	p_q2(%rsp),%xmm7 #z2	+=q
+		movaps	p_z12(%rsp),%xmm1			# z1  values
+
+		mulps	%xmm13,%xmm8
+		addps	%xmm8,%xmm1				#r1
+		movaps	%xmm1,%xmm8
+		mulps	.L__real_log2_tail(%rip),%xmm13
+		addps	%xmm13,%xmm7				#r2
+		movaps	%xmm7,%xmm9
+
+	#	logef to log10f
+		mulps 	.L__real_log10e_tail(%rip),%xmm7
+		mulps 	.L__real_log10e_tail(%rip),%xmm1
+		mulps 	.L__real_log10e_lead(%rip),%xmm9
+		mulps 	.L__real_log10e_lead(%rip),%xmm8
+		addps 	%xmm7,%xmm1
+		addps 	%xmm9,%xmm1
+		addps	%xmm8,%xmm1
+
+#		addps	%xmm7,%xmm1
+
+	# check e as a special case
+#		movaps	p_x2(%rsp),%xmm10
+#		cmpps	$0,.L__real_ef(%rip),%xmm10
+#		movmskps	%xmm10,%r9d
+	# check for e
+#		test		$0x0f,%r9d
+#		jnz			.L__vlogf_e2
+.L__f12:
+
+	# check for negative numbers or zero
+		xorps	%xmm7,%xmm7
+		cmpps	$1,p_x2(%rsp),%xmm7	# 0 greater than =?. catches NaNs also.
+		movmskps	%xmm7,%r9d
+		cmp		$0x0f,%r9d
+		jnz		.L__z_or_neg2
+
+.L__f22:
+	##  if +inf
+		movaps	p_x2(%rsp),%xmm9
+		cmpps	$0,.L__real_inf(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test		$0x0f,%r9d
+		jnz		.L__log_inf2
+.L__f32:
+
+		movaps	p_x2(%rsp),%xmm9
+		subps	.L__real_one(%rip),%xmm9
+		andps	.L__real_notsign(%rip),%xmm9
+		cmpps	$2,.L__real_threshold(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test	$0x0f,%r9d
+		jnz		.L__near_one2
+.L__f42:
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+.L__vlogf_e:
+	movdqa	p_x(%rsp),%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movdqa	%xmm2,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-e values
+	andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+	orps	%xmm3,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	jmp		.L__f1
+
+.L__vlogf_e2:
+		movdqa	p_x2(%rsp),%xmm2
+		cmpps	$0,.L__real_ef(%rip),%xmm2
+		movdqa	%xmm2,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-e values
+		andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+		orps	%xmm3,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		jmp		.L__f12
+
+	.align	16
+.L__near_one:
+# saves 10 cycles
+#      r = x - 1.0;
+	movdqa	%xmm3,p_omask(%rsp)	# save ones mask
+	movaps	p_x(%rsp),%xmm3
+	movaps	.L__real_two(%rip),%xmm2
+	subps	.L__real_one(%rip),%xmm3	   # r
+#      u          = r / (2.0 + r);
+	addps	%xmm3,%xmm2
+	movaps	%xmm3,%xmm1
+	divps	%xmm2,%xmm1		# u
+	movaps	.L__real_ca4(%rip),%xmm4	  #D
+	movaps	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movaps	%xmm3,%xmm6
+	mulps	%xmm1,%xmm6		# correction
+	movdqa	%xmm6,p_corr(%rsp)	# save correction
+#      u          = u + u;
+	addps	%xmm1,%xmm1		#u
+	movaps	%xmm1,%xmm2
+	mulps	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulps	%xmm1,%xmm5		# Cu
+	movaps	%xmm1,%xmm6
+	mulps	%xmm2,%xmm6		# u^3
+	mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulps	%xmm6,%xmm4		#Du^3
+
+	addps	.L__real_ca1(%rip),%xmm2	# +A
+	movaps	%xmm6,%xmm1
+	mulps	%xmm1,%xmm1		# u^6
+	addps	%xmm4,%xmm5		#Cu+Du3
+
+	mulps	%xmm6,%xmm2		#u3(A+Bu2)
+	mulps	%xmm5,%xmm1		#u6(Cu+Du3)
+	addps	%xmm1,%xmm2
+	subps	p_corr(%rsp),%xmm2		# -correction
+
+#  loge to log10
+	movaps   %xmm3,%xmm5 	#r1=r
+	pand 	.L__mask_lower(%rip),%xmm5
+	subps	%xmm5,%xmm3
+	addps	%xmm3,%xmm2	#r2 = r2 + (r-r1)
+
+	movaps	%xmm5,%xmm3
+	movaps	%xmm2,%xmm1
+
+	mulps 	.L__real_log10e_tail(%rip),%xmm2
+	mulps 	.L__real_log10e_tail(%rip),%xmm3
+	mulps 	.L__real_log10e_lead(%rip),%xmm1
+	mulps 	.L__real_log10e_lead(%rip),%xmm5
+	addps 	%xmm2,%xmm3
+	addps 	%xmm1,%xmm3
+	addps	%xmm5,%xmm3
+#      return r + r2;
+#	addps	%xmm2,%xmm3
+
+	movdqa	p_omask(%rsp),%xmm6
+	movdqa	%xmm6,%xmm2
+	andnps	%xmm0,%xmm6					# keep the non-nearone values
+	andps	%xmm3,%xmm2					# setup the nearone values
+	orps	%xmm6,%xmm2					# merge
+	movdqa	%xmm2,%xmm0					# and replace
+
+	jmp		.L__f4
+
+
+	.align	16
+.L__near_one2:
+# saves 10 cycles
+#      r = x - 1.0;
+		movdqa	%xmm9,p_omask(%rsp)	# save ones mask
+		movaps	p_x2(%rsp),%xmm3
+		movaps	.L__real_two(%rip),%xmm2
+		subps	.L__real_one(%rip),%xmm3	   # r
+	#      u          = r / (2.0 + r);
+		addps	%xmm3,%xmm2
+		movaps	%xmm3,%xmm7
+		divps	%xmm2,%xmm7		# u
+		movaps	.L__real_ca4(%rip),%xmm4	  #D
+		movaps	.L__real_ca3(%rip),%xmm5	  #C
+	#      correction = r * u;
+		movaps	%xmm3,%xmm6
+		mulps	%xmm7,%xmm6		# correction
+		movdqa	%xmm6,p_corr(%rsp)	# save correction
+	#      u          = u + u;
+		addps	%xmm7,%xmm7		#u
+		movaps	%xmm7,%xmm2
+		mulps	%xmm2,%xmm2		#v =u^2
+	#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+		mulps	%xmm7,%xmm5		# Cu
+		movaps	%xmm7,%xmm6
+		mulps	%xmm2,%xmm6		# u^3
+		mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+		mulps	%xmm6,%xmm4		#Du^3
+
+		addps	.L__real_ca1(%rip),%xmm2	# +A
+		movaps	%xmm6,%xmm7
+		mulps	%xmm7,%xmm7		# u^6
+		addps	%xmm4,%xmm5		#Cu+Du3
+
+		mulps	%xmm6,%xmm2		#u3(A+Bu2)
+		mulps	%xmm5,%xmm7		#u6(Cu+Du3)
+		addps	%xmm7,%xmm2
+		subps	p_corr(%rsp),%xmm2		# -correction
+
+		#loge to log10
+		movaps   %xmm3,%xmm5 	#r1=r
+		pand 	.L__mask_lower(%rip),%xmm5
+		subps	%xmm5,%xmm3
+		addps	%xmm3,%xmm2	#r2 = r2 + (r-r1)
+
+		movaps	%xmm5,%xmm3
+		movaps	%xmm2,%xmm7
+
+		mulps 	.L__real_log10e_tail(%rip),%xmm2
+		mulps 	.L__real_log10e_tail(%rip),%xmm3
+		mulps 	.L__real_log10e_lead(%rip),%xmm7
+		mulps 	.L__real_log10e_lead(%rip),%xmm5
+		addps 	%xmm2,%xmm3
+		addps 	%xmm7,%xmm3
+		addps	%xmm5,%xmm3
+
+	#      return r + r2;
+#		addps	%xmm2,%xmm3
+
+		movdqa	p_omask(%rsp),%xmm6
+		movdqa	%xmm6,%xmm2
+		andnps	%xmm1,%xmm6					# keep the non-nearone values
+		andps	%xmm3,%xmm2					# setup the nearone values
+		orps	%xmm6,%xmm2					# merge
+		movdqa	%xmm2,%xmm1					# and replace
+
+		jmp		.L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+	movdqa	%xmm1,%xmm3
+	andps	%xmm0,%xmm3							# keep the non-error values
+	andnps	.L__real_nan(%rip),%xmm1			# setup the nan values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+# check for +/- 0
+	xorps	%xmm1,%xmm1
+	cmpps	$0,p_x(%rsp),%xmm1	# 0 ?.
+	movmskps	%xmm1,%r9d
+	test		$0x0f,%r9d
+	jz		.L__zn2
+
+	movdqa	%xmm1,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	.L__real_ninf(%rip),%xmm1		# ; C99 specs -inf for +-0
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+
+.L__zn2:
+# check for NaNs
+	movaps	p_x(%rsp),%xmm3
+	andps	.L__real_inf(%rip),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+	movdqa	p_x(%rsp),%xmm4
+	pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+	pcmpeqd	.L__real_zero(%rip),%xmm4
+	pandn	%xmm3,%xmm4							# mask for NaNs
+	movdqa	%xmm4,%xmm2
+	movdqa	p_x(%rsp),%xmm1			# isolate the NaNs
+	pand	%xmm4,%xmm1
+
+	pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+	por		%xmm1,%xmm4							# turn SNaNs to QNaNs
+
+	movdqa	%xmm2,%xmm1
+	andnps	%xmm0,%xmm2							# keep the non-error values
+	orps	%xmm4,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	xorps	%xmm4,%xmm4
+
+	jmp		.L__f2
+
+# handle only +inf	 log(+inf) = inf
+.L__log_inf:
+	movdqa	%xmm3,%xmm1
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	p_x(%rsp),%xmm1			# setup the +inf values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+	jmp		.L__f3
+
+
+.L__z_or_neg2:
+	# deal with negatives first
+		movdqa	%xmm7,%xmm3
+		andps	%xmm1,%xmm3							# keep the non-error values
+		andnps	.L__real_nan(%rip),%xmm7			# setup the nan values
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+	# check for +/- 0
+		xorps	%xmm7,%xmm7
+		cmpps	$0,p_x2(%rsp),%xmm7	# 0 ?.
+		movmskps	%xmm7,%r9d
+		test		$0x0f,%r9d
+		jz		.L__zn22
+
+		movdqa	%xmm7,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-error values
+		andps	.L__real_ninf(%rip),%xmm7		# ; C99 specs -inf for +-0
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+
+.L__zn22:
+	# check for NaNs
+		movaps	p_x2(%rsp),%xmm3
+		andps	.L__real_inf(%rip),%xmm3
+		cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+		movdqa	p_x2(%rsp),%xmm4
+		pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+		pcmpeqd	.L__real_zero(%rip),%xmm4
+		pandn	%xmm3,%xmm4							# mask for NaNs
+		movdqa	%xmm4,%xmm2
+		movdqa	p_x2(%rsp),%xmm7			# isolate the NaNs
+		pand	%xmm4,%xmm7
+
+		pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+		por		%xmm7,%xmm4							# turn SNaNs to QNaNs
+
+		movdqa	%xmm2,%xmm7
+		andnps	%xmm1,%xmm2							# keep the non-error values
+		orps	%xmm4,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		xorps	%xmm4,%xmm4
+
+		jmp		.L__f22
+
+	# handle only +inf	 log(+inf) = inf
+.L__log_inf2:
+		movdqa	%xmm9,%xmm7
+		andnps	%xmm1,%xmm9							# keep the non-error values
+		andps	p_x2(%rsp),%xmm7			# setup the +inf values
+		orps	%xmm9,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+		jmp		.L__f32
+
+
+        .data
+        .align 64
+
+.L__real_zero:				.quad 0x00000000000000000	# 1.0
+					.quad 0x00000000000000000
+.L__real_one:				.quad 0x03f8000003f800000	# 1.0
+					.quad 0x03f8000003f800000
+.L__real_two:				.quad 0x04000000040000000	# 1.0
+					.quad 0x04000000040000000
+.L__real_ninf:				.quad 0x0ff800000ff800000	# -inf
+					.quad 0x0ff800000ff800000
+.L__real_inf:				.quad 0x07f8000007f800000	# +inf
+					.quad 0x07f8000007f800000
+.L__real_nan:				.quad 0x07fc000007fc00000	# NaN
+					.quad 0x07fc000007fc00000
+.L__real_ef:				.quad 0x0402DF854402DF854	# float e
+					.quad 0x0402DF854402DF854
+
+.L__real_sign:				.quad 0x08000000080000000	# sign bit
+					.quad 0x08000000080000000
+.L__real_notsign:			.quad 0x07ffFFFFF7ffFFFFF	# ^sign bit
+					.quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:			.quad 0x00040000000400000	# quiet nan bit
+					.quad 0x00040000000400000
+.L__real_mant:				.quad 0x0007FFFFF007FFFFF	# mantipsa bits
+					.quad 0x0007FFFFF007FFFFF
+.L__real_3c000000:			.quad 0x03c0000003c000000	# /* 0.0078125 = 1/128 */
+					.quad 0x03c0000003c000000
+.L__mask_127:				.quad 0x00000007f0000007f	#
+					.quad 0x00000007f0000007f
+.L__mask_040:				.quad 0x00000004000000040	#
+					.quad 0x00000004000000040
+.L__mask_001:				.quad 0x00000000100000001	#
+					.quad 0x00000000100000001
+
+
+.L__real_threshold:			.quad 0x03CF5C28F3CF5C28F	# .03
+					.quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333317923934e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.25000000037717509602e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3:				.quad 0x03B1249183B124918	# 2.23213998791944806202e-03
+					.quad 0x03B1249183B124918
+.L__real_ca4:				.quad 0x039E401A639E401A6	# 4.34887777707614552256e-04
+					.quad 0x039E401A639E401A6
+.L__real_cb1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333333593622e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.24999999978138668903e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3:				.quad 0x03B124A123B124A12	# 2.23219810758559851206e-03
+			.quad 0x03B124A123B124A12
+.L__real_log2_lead:     .quad 0x03F3170003F317000  # 0.693115234375
+                        .quad 0x03F3170003F317000
+.L__real_log2_tail:     .quad 0x03805FDF43805FDF4  # 0.000031946183
+                        .quad 0x03805FDF43805FDF4
+.L__real_half:		.quad 0x03f0000003f000000	# 1/2
+			.quad 0x03f0000003f000000
+
+.L__real_log10e_lead:	     .quad 0x03EDE00003EDE0000	# log10e_lead  0.4335937500
+                       .quad 0x03EDE00003EDE0000
+.L__real_log10e_tail:	      .quad 0x03A37B1523A37B152  # log10e_tail  0.0007007319
+                       .quad 0x03A37B1523A37B152
+
+
+.L__mask_lower:				.quad 0x0ffff0000ffff0000	#
+					.quad 0x0ffff0000ffff0000
+.L__np_ln__table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3F8FC0A8B0FC03E4		# 1.55041813850402832031e-02
+	.quad	0x3F9F829B0E783300		# 3.07716131210327148438e-02
+	.quad	0x3FA77458F632DCFC		# 4.58095073699951171875e-02
+	.quad	0x3FAF0A30C01162A6		# 6.06245994567871093750e-02
+	.quad	0x3FB341D7961BD1D1		# 7.52233862876892089844e-02
+	.quad	0x3FB6F0D28AE56B4C		# 8.96121263504028320312e-02
+	.quad	0x3FBA926D3A4AD563		# 1.03796780109405517578e-01
+	.quad	0x3FBE27076E2AF2E6		# 1.17783010005950927734e-01
+	.quad	0x3FC0D77E7CD08E59		# 1.31576299667358398438e-01
+	.quad	0x3FC29552F81FF523		# 1.45181953907012939453e-01
+	.quad	0x3FC44D2B6CCB7D1E		# 1.58604979515075683594e-01
+	.quad	0x3FC5FF3070A793D4		# 1.71850204467773437500e-01
+	.quad	0x3FC7AB890210D909		# 1.84922337532043457031e-01
+	.quad	0x3FC9525A9CF456B4		# 1.97825729846954345703e-01
+	.quad	0x3FCAF3C94E80BFF3		# 2.10564732551574707031e-01
+	.quad	0x3FCC8FF7C79A9A22		# 2.23143517971038818359e-01
+	.quad	0x3FCE27076E2AF2E6		# 2.35566020011901855469e-01
+	.quad	0x3FCFB9186D5E3E2B		# 2.47836112976074218750e-01
+	.quad	0x3FD0A324E27390E3		# 2.59957492351531982422e-01
+	.quad	0x3FD1675CABABA60E		# 2.71933674812316894531e-01
+	.quad	0x3FD22941FBCF7966		# 2.83768117427825927734e-01
+	.quad	0x3FD2E8E2BAE11D31		# 2.95464158058166503906e-01
+	.quad	0x3FD3A64C556945EA		# 3.07025015354156494141e-01
+	.quad	0x3FD4618BC21C5EC2		# 3.18453729152679443359e-01
+	.quad	0x3FD51AAD872DF82D		# 3.29753279685974121094e-01
+	.quad	0x3FD5D1BDBF5809CA		# 3.40926527976989746094e-01
+	.quad	0x3FD686C81E9B14AF		# 3.51976394653320312500e-01
+	.quad	0x3FD739D7F6BBD007		# 3.62905442714691162109e-01
+	.quad	0x3FD7EAF83B82AFC3		# 3.73716354370117187500e-01
+	.quad	0x3FD89A3386C1425B		# 3.84411692619323730469e-01
+	.quad	0x3FD947941C2116FB		# 3.94993782043457031250e-01
+	.quad	0x3FD9F323ECBF984C		# 4.05465066432952880859e-01
+	.quad	0x3FDA9CEC9A9A084A		# 4.15827870368957519531e-01
+	.quad	0x3FDB44F77BCC8F63		# 4.26084339618682861328e-01
+	.quad	0x3FDBEB4D9DA71B7C		# 4.36236739158630371094e-01
+	.quad	0x3FDC8FF7C79A9A22		# 4.46287095546722412109e-01
+	.quad	0x3FDD32FE7E00EBD5		# 4.56237375736236572266e-01
+	.quad	0x3FDDD46A04C1C4A1		# 4.66089725494384765625e-01
+	.quad	0x3FDE744261D68788		# 4.75845873355865478516e-01
+	.quad	0x3FDF128F5FAF06ED		# 4.85507786273956298828e-01
+	.quad	0x3FDFAF588F78F31F		# 4.95077252388000488281e-01
+	.quad	0x3FE02552A5A5D0FF		# 5.04556000232696533203e-01
+	.quad	0x3FE0723E5C1CDF40		# 5.13945698738098144531e-01
+	.quad	0x3FE0BE72E4252A83		# 5.23248136043548583984e-01
+	.quad	0x3FE109F39E2D4C97		# 5.32464742660522460938e-01
+	.quad	0x3FE154C3D2F4D5EA		# 5.41597247123718261719e-01
+	.quad	0x3FE19EE6B467C96F		# 5.50647079944610595703e-01
+	.quad	0x3FE1E85F5E7040D0		# 5.59615731239318847656e-01
+	.quad	0x3FE23130D7BEBF43		# 5.68504691123962402344e-01
+	.quad	0x3FE2795E1289B11B		# 5.77315330505371093750e-01
+	.quad	0x3FE2C0E9ED448E8C		# 5.86049020290374755859e-01
+	.quad	0x3FE307D7334F10BE		# 5.94707071781158447266e-01
+	.quad	0x3FE34E289D9CE1D3		# 6.03290796279907226562e-01
+	.quad	0x3FE393E0D3562A1A		# 6.11801505088806152344e-01
+	.quad	0x3FE3D9026A7156FB		# 6.20240390300750732422e-01
+	.quad	0x3FE41D8FE84672AE		# 6.28608644008636474609e-01
+	.quad	0x3FE4618BC21C5EC2		# 6.36907458305358886719e-01
+	.quad	0x3FE4A4F85DB03EBB		# 6.45137906074523925781e-01
+	.quad	0x3FE4E7D811B75BB1		# 6.53301239013671875000e-01
+	.quad	0x3FE52A2D265BC5AB		# 6.61398470401763916016e-01
+	.quad	0x3FE56BF9D5B3F399		# 6.69430613517761230469e-01
+	.quad	0x3FE5AD404C359F2D		# 6.77398800849914550781e-01
+	.quad	0x3FE5EE02A9241675		# 6.85303986072540283203e-01
+	.quad	0x3FE62E42FEFA39EF		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_lead_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x3C7E0000  # 0.015502929688 1
+    .long 0x3CFC1000  # 0.030769348145 2
+    .long 0x3D3BA000  # 0.045806884766 3
+    .long 0x3D785000  # 0.060623168945 4
+    .long 0x3D9A0000  # 0.075195312500 5
+    .long 0x3DB78000  # 0.089599609375 6
+    .long 0x3DD49000  # 0.103790283203 7
+    .long 0x3DF13000  # 0.117767333984 8
+    .long 0x3E06B000  # 0.131530761719 9
+    .long 0x3E14A000  # 0.145141601563 10
+    .long 0x3E226000  # 0.158569335938 11
+    .long 0x3E2FF000  # 0.171813964844 12
+    .long 0x3E3D5000  # 0.184875488281 13
+    .long 0x3E4A9000  # 0.197814941406 14
+    .long 0x3E579000  # 0.210510253906 15
+    .long 0x3E647000  # 0.223083496094 16
+    .long 0x3E713000  # 0.235534667969 17
+    .long 0x3E7DC000  # 0.247802734375 18
+    .long 0x3E851000  # 0.259887695313 19
+    .long 0x3E8B3000  # 0.271850585938 20
+    .long 0x3E914000  # 0.283691406250 21
+    .long 0x3E974000  # 0.295410156250 22
+    .long 0x3E9D3000  # 0.307006835938 23
+    .long 0x3EA30000  # 0.318359375000 24
+    .long 0x3EA8D000  # 0.329711914063 25
+    .long 0x3EAE8000  # 0.340820312500 26
+    .long 0x3EB43000  # 0.351928710938 27
+    .long 0x3EB9C000  # 0.362792968750 28
+    .long 0x3EBF5000  # 0.373657226563 29
+    .long 0x3EC4D000  # 0.384399414063 30
+    .long 0x3ECA3000  # 0.394897460938 31
+    .long 0x3ECF9000  # 0.405395507813 32
+    .long 0x3ED4E000  # 0.415771484375 33
+    .long 0x3EDA2000  # 0.426025390625 34
+    .long 0x3EDF5000  # 0.436157226563 35
+    .long 0x3EE47000  # 0.446166992188 36
+    .long 0x3EE99000  # 0.456176757813 37
+    .long 0x3EEEA000  # 0.466064453125 38
+    .long 0x3EF3A000  # 0.475830078125 39
+    .long 0x3EF89000  # 0.485473632813 40
+    .long 0x3EFD7000  # 0.494995117188 41
+    .long 0x3F012000  # 0.504394531250 42
+    .long 0x3F039000  # 0.513916015625 43
+    .long 0x3F05F000  # 0.523193359375 44
+    .long 0x3F084000  # 0.532226562500 45
+    .long 0x3F0AA000  # 0.541503906250 46
+    .long 0x3F0CF000  # 0.550537109375 47
+    .long 0x3F0F4000  # 0.559570312500 48
+    .long 0x3F118000  # 0.568359375000 49
+    .long 0x3F13C000  # 0.577148437500 50
+    .long 0x3F160000  # 0.585937500000 51
+    .long 0x3F183000  # 0.594482421875 52
+    .long 0x3F1A7000  # 0.603271484375 53
+    .long 0x3F1C9000  # 0.611572265625 54
+    .long 0x3F1EC000  # 0.620117187500 55
+    .long 0x3F20E000  # 0.628417968750 56
+    .long 0x3F230000  # 0.636718750000 57
+    .long 0x3F252000  # 0.645019531250 58
+    .long 0x3F273000  # 0.653076171875 59
+    .long 0x3F295000  # 0.661376953125 60
+    .long 0x3F2B5000  # 0.669189453125 61
+    .long 0x3F2D6000  # 0.677246093750 62
+    .long 0x3F2F7000  # 0.685302734375 63
+    .long 0x3F317000  # 0.693115234375 64
+    .long 0					# for alignment
+
+.L__np_ln_tail_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x35A8B0FC  # 0.000001256848 1
+    .long 0x361B0E78  # 0.000002310522 2
+    .long 0x3631EC66  # 0.000002651266 3
+    .long 0x35C30046  # 0.000001452871 4
+    .long 0x37EBCB0E  # 0.000028108738 5
+    .long 0x37528AE5  # 0.000012549314 6
+    .long 0x36DA7496  # 0.000006510479 7
+    .long 0x3783B715  # 0.000015701671 8
+    .long 0x383F3E68  # 0.000045596069 9
+    .long 0x38297C10  # 0.000040408282 10
+    .long 0x3815B666  # 0.000035694240 11
+    .long 0x38183854  # 0.000036292084 12
+    .long 0x38448108  # 0.000046850211 13
+    .long 0x373539E9  # 0.000010801924 14
+    .long 0x3864A740  # 0.000054515200 15
+    .long 0x387BE3CD  # 0.000060055219 16
+    .long 0x3803B715  # 0.000031403342 17
+    .long 0x380C36AF  # 0.000033429529 18
+    .long 0x3892713A  # 0.000069829126 19
+    .long 0x38AE55D6  # 0.000083129547 20
+    .long 0x38A0FDE8  # 0.000076766883 21
+    .long 0x3862BAE1  # 0.000054056643 22
+    .long 0x3798AAD3  # 0.000018199358 23
+    .long 0x38C5E10E  # 0.000094356117 24
+    .long 0x382D872E  # 0.000041372310 25
+    .long 0x38DEDFAC  # 0.000106274470 26
+    .long 0x38481E9B  # 0.000047712219 27
+    .long 0x38EBFB5E  # 0.000112524940 28
+    .long 0x38783B83  # 0.000059183232 29
+    .long 0x374E1B05  # 0.000012284848 30
+    .long 0x38CA0E11  # 0.000096347307 31
+    .long 0x3891F660  # 0.000069600297 32
+    .long 0x386C9A9A  # 0.000056410769 33
+    .long 0x38777BCD  # 0.000059004688 34
+    .long 0x38A6CED4  # 0.000079540216 35
+    .long 0x38FBE3CD  # 0.000120110439 36
+    .long 0x387E7E01  # 0.000060675669 37
+    .long 0x37D40984  # 0.000025276800 38
+    .long 0x3784C3AD  # 0.000015826745 39
+    .long 0x380F5FAF  # 0.000034182969 40
+    .long 0x38AC47BC  # 0.000082149607 41
+    .long 0x392952D3  # 0.000161479504 42
+    .long 0x37F97073  # 0.000029735476 43
+    .long 0x3865C84A  # 0.000054784388 44
+    .long 0x3979CF17  # 0.000238236375 45
+    .long 0x38C3D2F5  # 0.000093376184 46
+    .long 0x38E6B468  # 0.000110008579 47
+    .long 0x383EBCE1  # 0.000045475437 48
+    .long 0x39186BDF  # 0.000145360347 49
+    .long 0x392F0945  # 0.000166927537 50
+    .long 0x38E9ED45  # 0.000111545007 51
+    .long 0x396B99A8  # 0.000224685878 52
+    .long 0x37A27674  # 0.000019367064 53
+    .long 0x397069AB  # 0.000229275480 54
+    .long 0x39013539  # 0.000123222257 55
+    .long 0x3947F423  # 0.000190690669 56
+    .long 0x3945E10E  # 0.000188712234 57
+    .long 0x38F85DB0  # 0.000118430122 58
+    .long 0x396C08DC  # 0.000225100142 59
+    .long 0x37B4996F  # 0.000021529120 60
+    .long 0x397CEADA  # 0.000241200818 61
+    .long 0x3920261B  # 0.000152729845 62
+    .long 0x35AA4906  # 0.000001268724 63
+    .long 0x3805FDF4  # 0.000031946183 64
+    .long 0					# for alignment
+
+

diff --git a/src/gas/vrs8log2f.S b/src/gas/vrs8log2f.S
new file mode 100644
index 0000000..d1028b0
--- /dev/null
+++ b/src/gas/vrs8log2f.S

@@ -0,0 +1,956 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs8log2f.s
+#
+# A vector implementation of the logf libm function.
+#  This routine implemented in single precision.  It is slightly
+#  less accurate than the double precision version, but it will
+#  be better for vectorizing.
+#
+# Prototype:
+#
+#    __m128,__m128 __vrs8_log2f(__m128 x1, __m128 x2);
+#
+#   Computes the natural log of x for eight packed single values.
+#   Places the results into xmm0 and xmm1.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.
+#
+# This array version is basically a unrolling of the by4 scalar single
+# routine.  The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+# The scheduling is done by trial and error.  The resulting code represents
+# the best time of many variations.  It would seem more interleaving could
+# be done, as there is a long stretch of the second computation that is not
+# interleaved.  But moving any of this code forward makes the routine
+# slower.
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+# define local variable storage offsets
+.equ	p_x,0			# save x
+.equ	p_idx,0x010		# xmmword index
+.equ	p_z1,0x020		# xmmword index
+.equ	p_q,0x030		# xmmword index
+.equ	p_corr,0x040		# xmmword index
+.equ	p_omask,0x050		# xmmword index
+.equ	save_xmm6,0x060		#
+.equ	save_rbx,0x070		#
+.equ	save_xmm7,0x080		#
+.equ	save_xmm8,0x090		#
+.equ	save_xmm9,0x0a0		#
+.equ	save_xmm10,0x0b0		#
+.equ	save_xmm11,0x0c0		#
+.equ	save_xmm12,0x0d0		#
+.equ	save_xmm13,0x0d0		#
+.equ	p_x2,0x0100		# save x
+.equ	p_idx2,0x0110		# xmmword index
+.equ	p_z12,0x0120		# xmmword index
+.equ	p_q2,0x0130		# xmmword index
+
+.equ	stack_size,0x0168
+
+
+
+.globl __vrs8_log2f
+    .type   __vrs8_log2f,@function
+__vrs8_log2f:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+
+# check e as a special case
+	movdqa	%xmm0,p_x(%rsp)	# save x
+		movdqa	%xmm1,p_x2(%rsp)	# save x
+#	movdqa	%xmm0,%xmm2
+#	cmpps	$0,.L__real_ef(%rip),%xmm2
+#	movmskps	%xmm2,%r9d
+
+		movdqa	%xmm1,%xmm12
+		movdqa	%xmm1,%xmm9
+		movaps	%xmm1,%xmm7
+
+#
+# compute the index into the log tables
+#
+	movdqa	%xmm0,%xmm3
+	movaps	%xmm0,%xmm1
+	psrld	$23,%xmm3
+
+	#
+	# compute the index into the log tables
+	#
+		psrld	$23,%xmm9
+		subps	.L__real_one(%rip),%xmm7
+		psubd	.L__mask_127(%rip),%xmm9
+	subps	.L__real_one(%rip),%xmm1
+	psubd	.L__mask_127(%rip),%xmm3
+		cvtdq2ps	%xmm9,%xmm13			# xexp
+
+		movdqa	%xmm12,%xmm9
+		pand	.L__real_mant(%rip),%xmm9
+		xor		%r8,%r8
+		movdqa	%xmm9,%xmm8
+		movaps	.L__real_half(%rip),%xmm11							# .5
+	cvtdq2ps	%xmm3,%xmm6			# xexp
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+	xor		%r8,%r8
+	movdqa	%xmm3,%xmm2
+	movaps	.L__real_half(%rip),%xmm5							# .5
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrld	$16,%xmm3
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	movdqa	%xmm3,%xmm4
+		psrld	$16,%xmm9
+		movdqa	%xmm9,%xmm10
+		psrld	$1,%xmm9
+	psrld	$1,%xmm3
+	paddd	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm4
+	paddd	%xmm4,%xmm3
+	cvtdq2ps	%xmm3,%xmm1
+	#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+		paddd	.L__mask_040(%rip),%xmm9
+		pand	.L__mask_001(%rip),%xmm10
+		paddd	%xmm10,%xmm9
+		cvtdq2ps	%xmm9,%xmm7
+	packssdw	%xmm3,%xmm3
+	movq	%xmm3,p_idx(%rsp)
+		packssdw	%xmm9,%xmm9
+		movq	%xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+	movdqa	%xmm0,%xmm3
+	orps		.L__real_half(%rip),%xmm2
+
+
+	mulps	.L__real_3c000000(%rip),%xmm1				# f1 = index/128
+	# reduce and get u
+
+
+	subps	%xmm1,%xmm2											# f2 = f - f1
+	mulps	%xmm2,%xmm5
+	addps	%xmm5,%xmm1
+
+	divps	%xmm1,%xmm2				# u
+
+		movdqa	%xmm12,%xmm9
+		orps		.L__real_half(%rip),%xmm8
+
+
+		mulps	.L__real_3c000000(%rip),%xmm7				# f1 = index/128
+		subps	%xmm7,%xmm8											# f2 = f - f1
+		mulps	%xmm8,%xmm11
+		addps	%xmm11,%xmm7
+
+
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1(%rsp) 			# save the f1 values
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1+8(%rsp) 			# save the f1 value
+		divps	%xmm7,%xmm8				# u
+		lea		.L__np_ln_lead_table(%rip),%rdx
+		mov		p_idx2(%rsp),%rcx 			# get the indexes
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_z12(%rsp) 			# save the f1 values
+
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_z12+8(%rsp) 			# save the f1 value
+# solve for ln(1+u)
+	movaps	%xmm2,%xmm1				# u
+	mulps	%xmm2,%xmm2				# u^2
+	movaps	%xmm2,%xmm5
+	movaps	.L__real_cb3(%rip),%xmm3
+	mulps	%xmm2,%xmm3				#Cu2
+	mulps	%xmm1,%xmm5				# u^3
+	addps	.L__real_cb2(%rip),%xmm3 #B+Cu2
+	movaps	%xmm2,%xmm4
+	mulps	%xmm5,%xmm4				# u^5
+	movaps	.L__real_log2e_lead(%rip),%xmm2
+
+	mulps	.L__real_cb1(%rip),%xmm5 #Au3
+	addps	%xmm5,%xmm1				# u+Au3
+	mulps	%xmm3,%xmm4				# u5(B+Cu2)
+	movaps	.L__real_log2e_tail(%rip),%xmm3
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	addps	%xmm4,%xmm1				# poly
+
+# recombine
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f2 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q(%rsp) 			# save the f2 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+	mov		%cx,%r8w
+	mov		-256(%rdx,%r8,4),%ebx		# get the f2 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q+8(%rsp) 			# save the f2 value
+
+	addps	p_q(%rsp),%xmm1 #z2	+=q
+	movaps	%xmm1,%xmm4	#z2 copy
+	movaps	p_z1(%rsp),%xmm0			# z1  values
+	movaps	%xmm0,%xmm5	#z1 copy
+
+	mulps	%xmm2,%xmm5	#z1*log2e_lead
+	mulps	%xmm2,%xmm1	#z2*log2e_lead
+	mulps	%xmm3,%xmm4	#z2*log2e_tail
+	mulps	%xmm3,%xmm0	#z1*log2e_tail
+	addps	%xmm6,%xmm5	#r1 = z1*log2e_lead + xexp
+	addps	%xmm4,%xmm0	#z1*log2e_tail + z2*log2e_tail
+	addps	%xmm1,%xmm0	#r2
+#return r1+r2
+	addps 	%xmm5,%xmm0	# r1+ r2
+
+
+
+# check for e
+#	test		$0x0f,%r9d
+#	jnz			.L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+	xorps	%xmm1,%xmm1
+	cmpps	$1,p_x(%rsp),%xmm1	# 0 greater than =?. catches NaNs also.
+	movmskps	%xmm1,%r9d
+	cmp		$0x0f,%r9d
+	jnz		.L__z_or_neg
+
+.L__f2:
+##  if +inf
+	movaps	p_x(%rsp),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test		$0x0f,%r9d
+	jnz		.L__log_inf
+.L__f3:
+
+	movaps	p_x(%rsp),%xmm3
+	subps	.L__real_one(%rip),%xmm3
+	andps	.L__real_notsign(%rip),%xmm3
+	cmpps	$2,.L__real_threshold(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test	$0x0f,%r9d
+	jnz		.L__near_one
+.L__f4:
+
+# finish the second set of calculations
+
+	# solve for ln(1+u)
+		movaps	%xmm8,%xmm7				# u
+		mulps	%xmm8,%xmm8				# u^2
+		movaps	%xmm8,%xmm11
+
+		movaps	.L__real_cb3(%rip),%xmm9
+		mulps	%xmm8,%xmm9				#Cu2
+		mulps	%xmm7,%xmm11				# u^3
+		addps	.L__real_cb2(%rip),%xmm9 #B+Cu2
+		movaps	%xmm8,%xmm10
+		mulps	%xmm11,%xmm10				# u^5
+		movaps	.L__real_log2e_lead(%rip),%xmm8
+
+		mulps	.L__real_cb1(%rip),%xmm11 #Au3
+		addps	%xmm11,%xmm7				# u+Au3
+		mulps	%xmm9,%xmm10				# u5(B+Cu2)
+		movaps	.L__real_log2e_tail(%rip),%xmm9
+		addps	%xmm10,%xmm7				# poly
+
+
+	# recombine
+		lea		.L__np_ln_tail_table(%rip),%rdx
+		mov		p_idx2(%rsp),%rcx 			# get the indexes
+		mov		%cx,%r8w
+		shr		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+		mov		%cx,%r8w
+		shr		$16,%rcx
+		or		-256(%rdx,%r8,4),%ebx		# get the f2 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_q2(%rsp) 			# save the f2 value
+
+		mov		%cx,%r8w
+		shr		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+		mov		%cx,%r8w
+		mov		-256(%rdx,%r8,4),%ebx		# get the f2 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_q2+8(%rsp) 			# save the f2 value
+		addps	p_q2(%rsp),%xmm7 #z2	+=q
+		movaps	%xmm7,%xmm10	#z2 copy
+		movaps	p_z12(%rsp),%xmm1			# z1  values
+		movaps	%xmm1,%xmm11	#z1 copy
+
+		mulps	%xmm8,%xmm11	#z1*log2e_lead
+		mulps	%xmm8,%xmm7	#z2*log2e_lead
+		mulps	%xmm9,%xmm10	#z2*log2e_tail
+		mulps	%xmm9,%xmm1	#z1*log2e_tail
+		addps	%xmm13,%xmm11	#r1 = z1*log2e_lead + xexp
+		addps	%xmm10,%xmm1	#z1*log2e_tail + z2*log2e_tail
+		addps	%xmm7,%xmm1	#r2
+		#return r1+r2
+		addps 	%xmm11,%xmm1	# r1+ r2
+
+	# check e as a special case
+#		movaps	p_x2(%rsp),%xmm10
+#		cmpps	$0,.L__real_ef(%rip),%xmm10
+#		movmskps	%xmm10,%r9d
+	# check for e
+#		test		$0x0f,%r9d
+#		jnz			.L__vlogf_e2
+.L__f12:
+
+	# check for negative numbers or zero
+		xorps	%xmm7,%xmm7
+		cmpps	$1,p_x2(%rsp),%xmm7	# 0 greater than =?. catches NaNs also.
+		movmskps	%xmm7,%r9d
+		cmp		$0x0f,%r9d
+		jnz		.L__z_or_neg2
+
+.L__f22:
+	##  if +inf
+		movaps	p_x2(%rsp),%xmm9
+		cmpps	$0,.L__real_inf(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test		$0x0f,%r9d
+		jnz		.L__log_inf2
+.L__f32:
+
+		movaps	p_x2(%rsp),%xmm9
+		subps	.L__real_one(%rip),%xmm9
+		andps	.L__real_notsign(%rip),%xmm9
+		cmpps	$2,.L__real_threshold(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test	$0x0f,%r9d
+		jnz		.L__near_one2
+.L__f42:
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+.L__vlogf_e:
+	movdqa	p_x(%rsp),%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movdqa	%xmm2,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-e values
+	andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+	orps	%xmm3,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	jmp		.L__f1
+
+.L__vlogf_e2:
+		movdqa	p_x2(%rsp),%xmm2
+		cmpps	$0,.L__real_ef(%rip),%xmm2
+		movdqa	%xmm2,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-e values
+		andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+		orps	%xmm3,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		jmp		.L__f12
+
+	.align	16
+.L__near_one:
+# saves 10 cycles
+#      r = x - 1.0;
+	movdqa	%xmm3,p_omask(%rsp)	# save ones mask
+	movaps	p_x(%rsp),%xmm3
+	movaps	.L__real_two(%rip),%xmm2
+	subps	.L__real_one(%rip),%xmm3	   # r
+#      u          = r / (2.0 + r);
+	addps	%xmm3,%xmm2
+	movaps	%xmm3,%xmm1
+	divps	%xmm2,%xmm1		# u
+	movaps	.L__real_ca4(%rip),%xmm4	  #D
+	movaps	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movaps	%xmm3,%xmm6
+	mulps	%xmm1,%xmm6		# correction
+	movdqa	%xmm6,p_corr(%rsp)	# save correction
+#      u          = u + u;
+	addps	%xmm1,%xmm1		#u
+	movaps	%xmm1,%xmm2
+	mulps	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulps	%xmm1,%xmm5		# Cu
+	movaps	%xmm1,%xmm6
+	mulps	%xmm2,%xmm6		# u^3
+	mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulps	%xmm6,%xmm4		#Du^3
+
+	addps	.L__real_ca1(%rip),%xmm2	# +A
+	movaps	%xmm6,%xmm1
+	mulps	%xmm1,%xmm1		# u^6
+	addps	%xmm4,%xmm5		#Cu+Du3
+
+	mulps	%xmm6,%xmm2		#u3(A+Bu2)
+	mulps	%xmm5,%xmm1		#u6(Cu+Du3)
+	addps	%xmm1,%xmm2
+	subps	p_corr(%rsp),%xmm2		# -correction
+
+#   loge to log2
+	movaps  %xmm3,%xmm5 	#r1=r
+	pand 	.L__mask_lower(%rip),%xmm5
+	subps	%xmm5,%xmm3
+	addps	%xmm3,%xmm2	#r2 = r2 + (r-r1)
+
+	movaps	%xmm5,%xmm3
+	movaps	%xmm2,%xmm1
+
+	mulps 	.L__real_log2e_tail(%rip),%xmm2
+	mulps 	.L__real_log2e_tail(%rip),%xmm3
+	mulps 	.L__real_log2e_lead(%rip),%xmm1
+	mulps 	.L__real_log2e_lead(%rip),%xmm5
+	addps 	%xmm2,%xmm3
+	addps 	%xmm1,%xmm3
+	addps	%xmm5,%xmm3
+
+#      return r + r2;
+#	addps	%xmm2,%xmm3
+
+	movdqa	p_omask(%rsp),%xmm6
+	movdqa	%xmm6,%xmm2
+	andnps	%xmm0,%xmm6					# keep the non-nearone values
+	andps	%xmm3,%xmm2					# setup the nearone values
+	orps	%xmm6,%xmm2					# merge
+	movdqa	%xmm2,%xmm0					# and replace
+
+	jmp		.L__f4
+
+
+	.align	16
+.L__near_one2:
+# saves 10 cycles
+#      r = x - 1.0;
+		movdqa	%xmm9,p_omask(%rsp)	# save ones mask
+		movaps	p_x2(%rsp),%xmm3
+		movaps	.L__real_two(%rip),%xmm2
+		subps	.L__real_one(%rip),%xmm3	   # r
+	#      u          = r / (2.0 + r);
+		addps	%xmm3,%xmm2
+		movaps	%xmm3,%xmm7
+		divps	%xmm2,%xmm7		# u
+		movaps	.L__real_ca4(%rip),%xmm4	  #D
+		movaps	.L__real_ca3(%rip),%xmm5	  #C
+	#      correction = r * u;
+		movaps	%xmm3,%xmm6
+		mulps	%xmm7,%xmm6		# correction
+		movdqa	%xmm6,p_corr(%rsp)	# save correction
+	#      u          = u + u;
+		addps	%xmm7,%xmm7		#u
+		movaps	%xmm7,%xmm2
+		mulps	%xmm2,%xmm2		#v =u^2
+	#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+		mulps	%xmm7,%xmm5		# Cu
+		movaps	%xmm7,%xmm6
+		mulps	%xmm2,%xmm6		# u^3
+		mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+		mulps	%xmm6,%xmm4		#Du^3
+
+		addps	.L__real_ca1(%rip),%xmm2	# +A
+		movaps	%xmm6,%xmm7
+		mulps	%xmm7,%xmm7		# u^6
+		addps	%xmm4,%xmm5		#Cu+Du3
+
+		mulps	%xmm6,%xmm2		#u3(A+Bu2)
+		mulps	%xmm5,%xmm7		#u6(Cu+Du3)
+		addps	%xmm7,%xmm2
+		subps	p_corr(%rsp),%xmm2		# -correction
+
+# loge to log2
+		movaps  %xmm3,%xmm5 	#r1=r
+		pand 	.L__mask_lower(%rip),%xmm5
+		subps	%xmm5,%xmm3
+		addps	%xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+		movaps	%xmm5,%xmm3
+		movaps	%xmm2,%xmm7
+
+		mulps 	.L__real_log2e_tail(%rip),%xmm2
+		mulps 	.L__real_log2e_tail(%rip),%xmm3
+		mulps 	.L__real_log2e_lead(%rip),%xmm7
+		mulps 	.L__real_log2e_lead(%rip),%xmm5
+		addps 	%xmm2,%xmm3
+		addps 	%xmm7,%xmm3
+		addps	%xmm5,%xmm3
+
+	#      return r + r2;
+	#	addps	%xmm2,%xmm3
+
+		movdqa	p_omask(%rsp),%xmm6
+		movdqa	%xmm6,%xmm2
+		andnps	%xmm1,%xmm6					# keep the non-nearone values
+		andps	%xmm3,%xmm2					# setup the nearone values
+		orps	%xmm6,%xmm2					# merge
+		movdqa	%xmm2,%xmm1					# and replace
+
+		jmp		.L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+	movdqa	%xmm1,%xmm3
+	andps	%xmm0,%xmm3							# keep the non-error values
+	andnps	.L__real_nan(%rip),%xmm1			# setup the nan values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+# check for +/- 0
+	xorps	%xmm1,%xmm1
+	cmpps	$0,p_x(%rsp),%xmm1	# 0 ?.
+	movmskps	%xmm1,%r9d
+	test		$0x0f,%r9d
+	jz		.L__zn2
+
+	movdqa	%xmm1,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	.L__real_ninf(%rip),%xmm1		# ; C99 specs -inf for +-0
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+
+.L__zn2:
+# check for NaNs
+	movaps	p_x(%rsp),%xmm3
+	andps	.L__real_inf(%rip),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+	movdqa	p_x(%rsp),%xmm4
+	pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+	pcmpeqd	.L__real_zero(%rip),%xmm4
+	pandn	%xmm3,%xmm4							# mask for NaNs
+	movdqa	%xmm4,%xmm2
+	movdqa	p_x(%rsp),%xmm1			# isolate the NaNs
+	pand	%xmm4,%xmm1
+
+	pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+	por		%xmm1,%xmm4							# turn SNaNs to QNaNs
+
+	movdqa	%xmm2,%xmm1
+	andnps	%xmm0,%xmm2							# keep the non-error values
+	orps	%xmm4,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	xorps	%xmm4,%xmm4
+
+	jmp		.L__f2
+
+# handle only +inf	 log(+inf) = inf
+.L__log_inf:
+	movdqa	%xmm3,%xmm1
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	p_x(%rsp),%xmm1			# setup the +inf values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+	jmp		.L__f3
+
+
+.L__z_or_neg2:
+	# deal with negatives first
+		movdqa	%xmm7,%xmm3
+		andps	%xmm1,%xmm3							# keep the non-error values
+		andnps	.L__real_nan(%rip),%xmm7			# setup the nan values
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+	# check for +/- 0
+		xorps	%xmm7,%xmm7
+		cmpps	$0,p_x2(%rsp),%xmm7	# 0 ?.
+		movmskps	%xmm7,%r9d
+		test		$0x0f,%r9d
+		jz		.L__zn22
+
+		movdqa	%xmm7,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-error values
+		andps	.L__real_ninf(%rip),%xmm7		# ; C99 specs -inf for +-0
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+
+.L__zn22:
+	# check for NaNs
+		movaps	p_x2(%rsp),%xmm3
+		andps	.L__real_inf(%rip),%xmm3
+		cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+		movdqa	p_x2(%rsp),%xmm4
+		pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+		pcmpeqd	.L__real_zero(%rip),%xmm4
+		pandn	%xmm3,%xmm4							# mask for NaNs
+		movdqa	%xmm4,%xmm2
+		movdqa	p_x2(%rsp),%xmm7			# isolate the NaNs
+		pand	%xmm4,%xmm7
+
+		pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+		por		%xmm7,%xmm4							# turn SNaNs to QNaNs
+
+		movdqa	%xmm2,%xmm7
+		andnps	%xmm1,%xmm2							# keep the non-error values
+		orps	%xmm4,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		xorps	%xmm4,%xmm4
+
+		jmp		.L__f22
+
+	# handle only +inf	 log(+inf) = inf
+.L__log_inf2:
+		movdqa	%xmm9,%xmm7
+		andnps	%xmm1,%xmm9							# keep the non-error values
+		andps	p_x2(%rsp),%xmm7			# setup the +inf values
+		orps	%xmm9,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+		jmp		.L__f32
+
+
+        .data
+        .align 64
+
+.L__real_zero:				.quad 0x00000000000000000	# 1.0
+					.quad 0x00000000000000000
+.L__real_one:				.quad 0x03f8000003f800000	# 1.0
+					.quad 0x03f8000003f800000
+.L__real_two:				.quad 0x04000000040000000	# 1.0
+					.quad 0x04000000040000000
+.L__real_ninf:				.quad 0x0ff800000ff800000	# -inf
+					.quad 0x0ff800000ff800000
+.L__real_inf:				.quad 0x07f8000007f800000	# +inf
+					.quad 0x07f8000007f800000
+.L__real_nan:				.quad 0x07fc000007fc00000	# NaN
+					.quad 0x07fc000007fc00000
+.L__real_ef:				.quad 0x0402DF854402DF854	# float e
+					.quad 0x0402DF854402DF854
+
+.L__real_sign:				.quad 0x08000000080000000	# sign bit
+					.quad 0x08000000080000000
+.L__real_notsign:			.quad 0x07ffFFFFF7ffFFFFF	# ^sign bit
+					.quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:			.quad 0x00040000000400000	# quiet nan bit
+					.quad 0x00040000000400000
+.L__real_mant:				.quad 0x0007FFFFF007FFFFF	# mantipsa bits
+					.quad 0x0007FFFFF007FFFFF
+.L__real_3c000000:			.quad 0x03c0000003c000000	# /* 0.0078125 = 1/128 */
+					.quad 0x03c0000003c000000
+.L__mask_127:				.quad 0x00000007f0000007f	#
+					.quad 0x00000007f0000007f
+.L__mask_040:				.quad 0x00000004000000040	#
+					.quad 0x00000004000000040
+.L__mask_001:				.quad 0x00000000100000001	#
+					.quad 0x00000000100000001
+
+
+.L__real_threshold:			.quad 0x03CF5C28F3CF5C28F	# .03
+					.quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333317923934e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.25000000037717509602e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3:				.quad 0x03B1249183B124918	# 2.23213998791944806202e-03
+					.quad 0x03B1249183B124918
+.L__real_ca4:				.quad 0x039E401A639E401A6	# 4.34887777707614552256e-04
+					.quad 0x039E401A639E401A6
+.L__real_cb1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333333593622e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.24999999978138668903e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3:				.quad 0x03B124A123B124A12	# 2.23219810758559851206e-03
+			.quad 0x03B124A123B124A12
+.L__real_log2_lead:     .quad 0x03F3170003F317000  # 0.693115234375
+                        .quad 0x03F3170003F317000
+.L__real_log2_tail:     .quad 0x03805FDF43805FDF4  # 0.000031946183
+                        .quad 0x03805FDF43805FDF4
+.L__real_half:		.quad 0x03f0000003f000000	# 1/2
+			.quad 0x03f0000003f000000
+.L__real_log2e_lead:       .quad 0x03FB800003FB80000  #1.4375000000
+                        .quad 0x03FB800003FB80000
+.L__real_log2e_tail:       .quad 0x03BAA3B293BAA3B29  # 0.0051950408889633
+                        .quad 0x03BAA3B293BAA3B29
+
+.L__mask_lower:			.quad 0x0ffff0000ffff0000	#
+						.quad 0x0ffff0000ffff0000
+
+.L__np_ln__table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3F8FC0A8B0FC03E4		# 1.55041813850402832031e-02
+	.quad	0x3F9F829B0E783300		# 3.07716131210327148438e-02
+	.quad	0x3FA77458F632DCFC		# 4.58095073699951171875e-02
+	.quad	0x3FAF0A30C01162A6		# 6.06245994567871093750e-02
+	.quad	0x3FB341D7961BD1D1		# 7.52233862876892089844e-02
+	.quad	0x3FB6F0D28AE56B4C		# 8.96121263504028320312e-02
+	.quad	0x3FBA926D3A4AD563		# 1.03796780109405517578e-01
+	.quad	0x3FBE27076E2AF2E6		# 1.17783010005950927734e-01
+	.quad	0x3FC0D77E7CD08E59		# 1.31576299667358398438e-01
+	.quad	0x3FC29552F81FF523		# 1.45181953907012939453e-01
+	.quad	0x3FC44D2B6CCB7D1E		# 1.58604979515075683594e-01
+	.quad	0x3FC5FF3070A793D4		# 1.71850204467773437500e-01
+	.quad	0x3FC7AB890210D909		# 1.84922337532043457031e-01
+	.quad	0x3FC9525A9CF456B4		# 1.97825729846954345703e-01
+	.quad	0x3FCAF3C94E80BFF3		# 2.10564732551574707031e-01
+	.quad	0x3FCC8FF7C79A9A22		# 2.23143517971038818359e-01
+	.quad	0x3FCE27076E2AF2E6		# 2.35566020011901855469e-01
+	.quad	0x3FCFB9186D5E3E2B		# 2.47836112976074218750e-01
+	.quad	0x3FD0A324E27390E3		# 2.59957492351531982422e-01
+	.quad	0x3FD1675CABABA60E		# 2.71933674812316894531e-01
+	.quad	0x3FD22941FBCF7966		# 2.83768117427825927734e-01
+	.quad	0x3FD2E8E2BAE11D31		# 2.95464158058166503906e-01
+	.quad	0x3FD3A64C556945EA		# 3.07025015354156494141e-01
+	.quad	0x3FD4618BC21C5EC2		# 3.18453729152679443359e-01
+	.quad	0x3FD51AAD872DF82D		# 3.29753279685974121094e-01
+	.quad	0x3FD5D1BDBF5809CA		# 3.40926527976989746094e-01
+	.quad	0x3FD686C81E9B14AF		# 3.51976394653320312500e-01
+	.quad	0x3FD739D7F6BBD007		# 3.62905442714691162109e-01
+	.quad	0x3FD7EAF83B82AFC3		# 3.73716354370117187500e-01
+	.quad	0x3FD89A3386C1425B		# 3.84411692619323730469e-01
+	.quad	0x3FD947941C2116FB		# 3.94993782043457031250e-01
+	.quad	0x3FD9F323ECBF984C		# 4.05465066432952880859e-01
+	.quad	0x3FDA9CEC9A9A084A		# 4.15827870368957519531e-01
+	.quad	0x3FDB44F77BCC8F63		# 4.26084339618682861328e-01
+	.quad	0x3FDBEB4D9DA71B7C		# 4.36236739158630371094e-01
+	.quad	0x3FDC8FF7C79A9A22		# 4.46287095546722412109e-01
+	.quad	0x3FDD32FE7E00EBD5		# 4.56237375736236572266e-01
+	.quad	0x3FDDD46A04C1C4A1		# 4.66089725494384765625e-01
+	.quad	0x3FDE744261D68788		# 4.75845873355865478516e-01
+	.quad	0x3FDF128F5FAF06ED		# 4.85507786273956298828e-01
+	.quad	0x3FDFAF588F78F31F		# 4.95077252388000488281e-01
+	.quad	0x3FE02552A5A5D0FF		# 5.04556000232696533203e-01
+	.quad	0x3FE0723E5C1CDF40		# 5.13945698738098144531e-01
+	.quad	0x3FE0BE72E4252A83		# 5.23248136043548583984e-01
+	.quad	0x3FE109F39E2D4C97		# 5.32464742660522460938e-01
+	.quad	0x3FE154C3D2F4D5EA		# 5.41597247123718261719e-01
+	.quad	0x3FE19EE6B467C96F		# 5.50647079944610595703e-01
+	.quad	0x3FE1E85F5E7040D0		# 5.59615731239318847656e-01
+	.quad	0x3FE23130D7BEBF43		# 5.68504691123962402344e-01
+	.quad	0x3FE2795E1289B11B		# 5.77315330505371093750e-01
+	.quad	0x3FE2C0E9ED448E8C		# 5.86049020290374755859e-01
+	.quad	0x3FE307D7334F10BE		# 5.94707071781158447266e-01
+	.quad	0x3FE34E289D9CE1D3		# 6.03290796279907226562e-01
+	.quad	0x3FE393E0D3562A1A		# 6.11801505088806152344e-01
+	.quad	0x3FE3D9026A7156FB		# 6.20240390300750732422e-01
+	.quad	0x3FE41D8FE84672AE		# 6.28608644008636474609e-01
+	.quad	0x3FE4618BC21C5EC2		# 6.36907458305358886719e-01
+	.quad	0x3FE4A4F85DB03EBB		# 6.45137906074523925781e-01
+	.quad	0x3FE4E7D811B75BB1		# 6.53301239013671875000e-01
+	.quad	0x3FE52A2D265BC5AB		# 6.61398470401763916016e-01
+	.quad	0x3FE56BF9D5B3F399		# 6.69430613517761230469e-01
+	.quad	0x3FE5AD404C359F2D		# 6.77398800849914550781e-01
+	.quad	0x3FE5EE02A9241675		# 6.85303986072540283203e-01
+	.quad	0x3FE62E42FEFA39EF		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_lead_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x3C7E0000  # 0.015502929688 1
+    .long 0x3CFC1000  # 0.030769348145 2
+    .long 0x3D3BA000  # 0.045806884766 3
+    .long 0x3D785000  # 0.060623168945 4
+    .long 0x3D9A0000  # 0.075195312500 5
+    .long 0x3DB78000  # 0.089599609375 6
+    .long 0x3DD49000  # 0.103790283203 7
+    .long 0x3DF13000  # 0.117767333984 8
+    .long 0x3E06B000  # 0.131530761719 9
+    .long 0x3E14A000  # 0.145141601563 10
+    .long 0x3E226000  # 0.158569335938 11
+    .long 0x3E2FF000  # 0.171813964844 12
+    .long 0x3E3D5000  # 0.184875488281 13
+    .long 0x3E4A9000  # 0.197814941406 14
+    .long 0x3E579000  # 0.210510253906 15
+    .long 0x3E647000  # 0.223083496094 16
+    .long 0x3E713000  # 0.235534667969 17
+    .long 0x3E7DC000  # 0.247802734375 18
+    .long 0x3E851000  # 0.259887695313 19
+    .long 0x3E8B3000  # 0.271850585938 20
+    .long 0x3E914000  # 0.283691406250 21
+    .long 0x3E974000  # 0.295410156250 22
+    .long 0x3E9D3000  # 0.307006835938 23
+    .long 0x3EA30000  # 0.318359375000 24
+    .long 0x3EA8D000  # 0.329711914063 25
+    .long 0x3EAE8000  # 0.340820312500 26
+    .long 0x3EB43000  # 0.351928710938 27
+    .long 0x3EB9C000  # 0.362792968750 28
+    .long 0x3EBF5000  # 0.373657226563 29
+    .long 0x3EC4D000  # 0.384399414063 30
+    .long 0x3ECA3000  # 0.394897460938 31
+    .long 0x3ECF9000  # 0.405395507813 32
+    .long 0x3ED4E000  # 0.415771484375 33
+    .long 0x3EDA2000  # 0.426025390625 34
+    .long 0x3EDF5000  # 0.436157226563 35
+    .long 0x3EE47000  # 0.446166992188 36
+    .long 0x3EE99000  # 0.456176757813 37
+    .long 0x3EEEA000  # 0.466064453125 38
+    .long 0x3EF3A000  # 0.475830078125 39
+    .long 0x3EF89000  # 0.485473632813 40
+    .long 0x3EFD7000  # 0.494995117188 41
+    .long 0x3F012000  # 0.504394531250 42
+    .long 0x3F039000  # 0.513916015625 43
+    .long 0x3F05F000  # 0.523193359375 44
+    .long 0x3F084000  # 0.532226562500 45
+    .long 0x3F0AA000  # 0.541503906250 46
+    .long 0x3F0CF000  # 0.550537109375 47
+    .long 0x3F0F4000  # 0.559570312500 48
+    .long 0x3F118000  # 0.568359375000 49
+    .long 0x3F13C000  # 0.577148437500 50
+    .long 0x3F160000  # 0.585937500000 51
+    .long 0x3F183000  # 0.594482421875 52
+    .long 0x3F1A7000  # 0.603271484375 53
+    .long 0x3F1C9000  # 0.611572265625 54
+    .long 0x3F1EC000  # 0.620117187500 55
+    .long 0x3F20E000  # 0.628417968750 56
+    .long 0x3F230000  # 0.636718750000 57
+    .long 0x3F252000  # 0.645019531250 58
+    .long 0x3F273000  # 0.653076171875 59
+    .long 0x3F295000  # 0.661376953125 60
+    .long 0x3F2B5000  # 0.669189453125 61
+    .long 0x3F2D6000  # 0.677246093750 62
+    .long 0x3F2F7000  # 0.685302734375 63
+    .long 0x3F317000  # 0.693115234375 64
+    .long 0					# for alignment
+
+.L__np_ln_tail_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x35A8B0FC  # 0.000001256848 1
+    .long 0x361B0E78  # 0.000002310522 2
+    .long 0x3631EC66  # 0.000002651266 3
+    .long 0x35C30046  # 0.000001452871 4
+    .long 0x37EBCB0E  # 0.000028108738 5
+    .long 0x37528AE5  # 0.000012549314 6
+    .long 0x36DA7496  # 0.000006510479 7
+    .long 0x3783B715  # 0.000015701671 8
+    .long 0x383F3E68  # 0.000045596069 9
+    .long 0x38297C10  # 0.000040408282 10
+    .long 0x3815B666  # 0.000035694240 11
+    .long 0x38183854  # 0.000036292084 12
+    .long 0x38448108  # 0.000046850211 13
+    .long 0x373539E9  # 0.000010801924 14
+    .long 0x3864A740  # 0.000054515200 15
+    .long 0x387BE3CD  # 0.000060055219 16
+    .long 0x3803B715  # 0.000031403342 17
+    .long 0x380C36AF  # 0.000033429529 18
+    .long 0x3892713A  # 0.000069829126 19
+    .long 0x38AE55D6  # 0.000083129547 20
+    .long 0x38A0FDE8  # 0.000076766883 21
+    .long 0x3862BAE1  # 0.000054056643 22
+    .long 0x3798AAD3  # 0.000018199358 23
+    .long 0x38C5E10E  # 0.000094356117 24
+    .long 0x382D872E  # 0.000041372310 25
+    .long 0x38DEDFAC  # 0.000106274470 26
+    .long 0x38481E9B  # 0.000047712219 27
+    .long 0x38EBFB5E  # 0.000112524940 28
+    .long 0x38783B83  # 0.000059183232 29
+    .long 0x374E1B05  # 0.000012284848 30
+    .long 0x38CA0E11  # 0.000096347307 31
+    .long 0x3891F660  # 0.000069600297 32
+    .long 0x386C9A9A  # 0.000056410769 33
+    .long 0x38777BCD  # 0.000059004688 34
+    .long 0x38A6CED4  # 0.000079540216 35
+    .long 0x38FBE3CD  # 0.000120110439 36
+    .long 0x387E7E01  # 0.000060675669 37
+    .long 0x37D40984  # 0.000025276800 38
+    .long 0x3784C3AD  # 0.000015826745 39
+    .long 0x380F5FAF  # 0.000034182969 40
+    .long 0x38AC47BC  # 0.000082149607 41
+    .long 0x392952D3  # 0.000161479504 42
+    .long 0x37F97073  # 0.000029735476 43
+    .long 0x3865C84A  # 0.000054784388 44
+    .long 0x3979CF17  # 0.000238236375 45
+    .long 0x38C3D2F5  # 0.000093376184 46
+    .long 0x38E6B468  # 0.000110008579 47
+    .long 0x383EBCE1  # 0.000045475437 48
+    .long 0x39186BDF  # 0.000145360347 49
+    .long 0x392F0945  # 0.000166927537 50
+    .long 0x38E9ED45  # 0.000111545007 51
+    .long 0x396B99A8  # 0.000224685878 52
+    .long 0x37A27674  # 0.000019367064 53
+    .long 0x397069AB  # 0.000229275480 54
+    .long 0x39013539  # 0.000123222257 55
+    .long 0x3947F423  # 0.000190690669 56
+    .long 0x3945E10E  # 0.000188712234 57
+    .long 0x38F85DB0  # 0.000118430122 58
+    .long 0x396C08DC  # 0.000225100142 59
+    .long 0x37B4996F  # 0.000021529120 60
+    .long 0x397CEADA  # 0.000241200818 61
+    .long 0x3920261B  # 0.000152729845 62
+    .long 0x35AA4906  # 0.000001268724 63
+    .long 0x3805FDF4  # 0.000031946183 64
+    .long 0					# for alignment
+
+

diff --git a/src/gas/vrs8logf.S b/src/gas/vrs8logf.S
new file mode 100644
index 0000000..a5e7ed9
--- /dev/null
+++ b/src/gas/vrs8logf.S

@@ -0,0 +1,904 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs8logf.s
+#
+# A vector implementation of the logf libm function.
+#  This routine implemented in single precision.  It is slightly
+#  less accurate than the double precision version, but it will
+#  be better for vectorizing.
+#
+# Prototype:
+#
+#    __m128,__m128 __vrs8_logf(__m128 x1, __m128 x2);
+#
+#   Computes the natural log of x for eight packed single values.
+#   Places the results into xmm0 and xmm1.
+#   Returns proper C99 values, but may not raise status flags properly.
+#   Less than 1 ulp of error.
+#
+# This array version is basically a unrolling of the by4 scalar single
+# routine.  The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+# The scheduling is done by trial and error.  The resulting code represents
+# the best time of many variations.  It would seem more interleaving could
+# be done, as there is a long stretch of the second computation that is not
+# interleaved.  But moving any of this code forward makes the routine
+# slower.
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+# define local variable storage offsets
+.equ	p_x,0			# save x
+.equ	p_idx,0x010		# xmmword index
+.equ	p_z1,0x020		# xmmword index
+.equ	p_q,0x030		# xmmword index
+.equ	p_corr,0x040		# xmmword index
+.equ	p_omask,0x050		# xmmword index
+.equ	save_xmm6,0x060		#
+.equ	save_rbx,0x070		#
+.equ	save_xmm7,0x080		#
+.equ	save_xmm8,0x090		#
+.equ	save_xmm9,0x0a0		#
+.equ	save_xmm10,0x0b0		#
+.equ	save_xmm11,0x0c0		#
+.equ	save_xmm12,0x0d0		#
+.equ	save_xmm13,0x0d0		#
+.equ	p_x2,0x0100		# save x
+.equ	p_idx2,0x0110		# xmmword index
+.equ	p_z12,0x0120		# xmmword index
+.equ	p_q2,0x0130		# xmmword index
+
+.equ	stack_size,0x0168
+
+
+
+.globl __vrs8_logf
+    .type   __vrs8_logf,@function
+__vrs8_logf:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+
+# check e as a special case
+	movdqa	%xmm0,p_x(%rsp)	# save x
+		movdqa	%xmm1,p_x2(%rsp)	# save x
+	movdqa	%xmm0,%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movmskps	%xmm2,%r9d
+
+		movdqa	%xmm1,%xmm12
+		movdqa	%xmm1,%xmm9
+		movaps	%xmm1,%xmm7
+
+#
+# compute the index into the log tables
+#
+	movdqa	%xmm0,%xmm3
+	movaps	%xmm0,%xmm1
+	psrld	$23,%xmm3
+
+	#
+	# compute the index into the log tables
+	#
+		psrld	$23,%xmm9
+		subps	.L__real_one(%rip),%xmm7
+		psubd	.L__mask_127(%rip),%xmm9
+	subps	.L__real_one(%rip),%xmm1
+	psubd	.L__mask_127(%rip),%xmm3
+		cvtdq2ps	%xmm9,%xmm13			# xexp
+
+		movdqa	%xmm12,%xmm9
+		pand	.L__real_mant(%rip),%xmm9
+		xor		%r8,%r8
+		movdqa	%xmm9,%xmm8
+		movaps	.L__real_half(%rip),%xmm11							# .5
+	cvtdq2ps	%xmm3,%xmm6			# xexp
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+	xor		%r8,%r8
+	movdqa	%xmm3,%xmm2
+	movaps	.L__real_half(%rip),%xmm5							# .5
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrld	$16,%xmm3
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	movdqa	%xmm3,%xmm4
+		psrld	$16,%xmm9
+		movdqa	%xmm9,%xmm10
+		psrld	$1,%xmm9
+	psrld	$1,%xmm3
+	paddd	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm4
+	paddd	%xmm4,%xmm3
+	cvtdq2ps	%xmm3,%xmm1
+	#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+		paddd	.L__mask_040(%rip),%xmm9
+		pand	.L__mask_001(%rip),%xmm10
+		paddd	%xmm10,%xmm9
+		cvtdq2ps	%xmm9,%xmm7
+	packssdw	%xmm3,%xmm3
+	movq	%xmm3,p_idx(%rsp)
+		packssdw	%xmm9,%xmm9
+		movq	%xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+	movdqa	%xmm0,%xmm3
+	orps		.L__real_half(%rip),%xmm2
+
+
+	mulps	.L__real_3c000000(%rip),%xmm1				# f1 = index/128
+	# reduce and get u
+
+
+	subps	%xmm1,%xmm2											# f2 = f - f1
+	mulps	%xmm2,%xmm5
+	addps	%xmm5,%xmm1
+
+	divps	%xmm1,%xmm2				# u
+
+		movdqa	%xmm12,%xmm9
+		orps		.L__real_half(%rip),%xmm8
+
+
+		mulps	.L__real_3c000000(%rip),%xmm7				# f1 = index/128
+		subps	%xmm7,%xmm8											# f2 = f - f1
+		mulps	%xmm8,%xmm11
+		addps	%xmm11,%xmm7
+
+
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1(%rsp) 			# save the f1 values
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+	mov		%cx,%r8w
+	ror		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_z1+8(%rsp) 			# save the f1 value
+		divps	%xmm7,%xmm8				# u
+		lea		.L__np_ln_lead_table(%rip),%rdx
+		mov		p_idx2(%rsp),%rcx 			# get the indexes
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		mov		-256(%rdx,%r8,4),%ebx		# get the f1 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_z12(%rsp) 			# save the f1 values
+
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f1 value
+
+		mov		%cx,%r8w
+		ror		$16,%rcx
+		or		-256(%rdx,%r8,4),%ebx		# get the f1 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_z12+8(%rsp) 			# save the f1 value
+# solve for ln(1+u)
+	movaps	%xmm2,%xmm1				# u
+	mulps	%xmm2,%xmm2				# u^2
+	movaps	%xmm2,%xmm5
+	movaps	.L__real_cb3(%rip),%xmm3
+	mulps	%xmm2,%xmm3				#Cu2
+	mulps	%xmm1,%xmm5				# u^3
+	addps	.L__real_cb2(%rip),%xmm3 #B+Cu2
+	movaps	%xmm2,%xmm4
+	mulps	%xmm5,%xmm4				# u^5
+	movaps	.L__real_log2_lead(%rip),%xmm2
+
+	mulps	.L__real_cb1(%rip),%xmm5 #Au3
+	addps	%xmm5,%xmm1				# u+Au3
+	mulps	%xmm3,%xmm4				# u5(B+Cu2)
+
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	addps	%xmm4,%xmm1				# poly
+
+# recombine
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	or		-256(%rdx,%r8,4),%ebx		# get the f2 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q(%rsp) 			# save the f2 value
+
+	mov		%cx,%r8w
+	shr		$16,%rcx
+	mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+	mov		%cx,%r8w
+	mov		-256(%rdx,%r8,4),%ebx		# get the f2 value
+	shl		$32,%rbx
+	or		%rbx,%rax
+	mov		 %rax,p_q+8(%rsp) 			# save the f2 value
+
+	addps	p_q(%rsp),%xmm1 #z2	+=q
+
+	movaps	p_z1(%rsp),%xmm0			# z1  values
+
+	mulps	%xmm6,%xmm2
+	addps	%xmm2,%xmm0				#r1
+	mulps	.L__real_log2_tail(%rip),%xmm6
+	addps	%xmm6,%xmm1				#r2
+	addps	%xmm1,%xmm0
+
+
+
+# check for e
+	test		$0x0f,%r9d
+	jnz			.L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+	xorps	%xmm1,%xmm1
+	cmpps	$1,p_x(%rsp),%xmm1	# 0 greater than =?. catches NaNs also.
+	movmskps	%xmm1,%r9d
+	cmp		$0x0f,%r9d
+	jnz		.L__z_or_neg
+
+.L__f2:
+##  if +inf
+	movaps	p_x(%rsp),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test		$0x0f,%r9d
+	jnz		.L__log_inf
+.L__f3:
+
+	movaps	p_x(%rsp),%xmm3
+	subps	.L__real_one(%rip),%xmm3
+	andps	.L__real_notsign(%rip),%xmm3
+	cmpps	$2,.L__real_threshold(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test	$0x0f,%r9d
+	jnz		.L__near_one
+.L__f4:
+
+# finish the second set of calculations
+
+	# solve for ln(1+u)
+		movaps	%xmm8,%xmm7				# u
+		mulps	%xmm8,%xmm8				# u^2
+		movaps	%xmm8,%xmm11
+
+		movaps	.L__real_cb3(%rip),%xmm9
+		mulps	%xmm8,%xmm9				#Cu2
+		mulps	%xmm7,%xmm11				# u^3
+		addps	.L__real_cb2(%rip),%xmm9 #B+Cu2
+		movaps	%xmm8,%xmm10
+		mulps	%xmm11,%xmm10				# u^5
+		movaps	.L__real_log2_lead(%rip),%xmm8
+
+		mulps	.L__real_cb1(%rip),%xmm11 #Au3
+		addps	%xmm11,%xmm7				# u+Au3
+		mulps	%xmm9,%xmm10				# u5(B+Cu2)
+		addps	%xmm10,%xmm7				# poly
+
+
+	# recombine
+		lea		.L__np_ln_tail_table(%rip),%rdx
+		mov		p_idx2(%rsp),%rcx 			# get the indexes
+		mov		%cx,%r8w
+		shr		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+		mov		%cx,%r8w
+		shr		$16,%rcx
+		or		-256(%rdx,%r8,4),%ebx		# get the f2 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_q2(%rsp) 			# save the f2 value
+
+		mov		%cx,%r8w
+		shr		$16,%rcx
+		mov		-256(%rdx,%r8,4),%eax		# get the f2 value
+
+		mov		%cx,%r8w
+		mov		-256(%rdx,%r8,4),%ebx		# get the f2 value
+		shl		$32,%rbx
+		or		%rbx,%rax
+		mov		 %rax,p_q2+8(%rsp) 			# save the f2 value
+		addps	p_q2(%rsp),%xmm7 #z2	+=q
+		movaps	p_z12(%rsp),%xmm1			# z1  values
+
+		mulps	%xmm13,%xmm8
+		addps	%xmm8,%xmm1				#r1
+		mulps	.L__real_log2_tail(%rip),%xmm13
+		addps	%xmm13,%xmm7				#r2
+		addps	%xmm7,%xmm1
+
+	# check e as a special case
+		movaps	p_x2(%rsp),%xmm10
+		cmpps	$0,.L__real_ef(%rip),%xmm10
+		movmskps	%xmm10,%r9d
+	# check for e
+		test		$0x0f,%r9d
+		jnz			.L__vlogf_e2
+.L__f12:
+
+	# check for negative numbers or zero
+		xorps	%xmm7,%xmm7
+		cmpps	$1,p_x2(%rsp),%xmm7	# 0 greater than =?. catches NaNs also.
+		movmskps	%xmm7,%r9d
+		cmp		$0x0f,%r9d
+		jnz		.L__z_or_neg2
+
+.L__f22:
+	##  if +inf
+		movaps	p_x2(%rsp),%xmm9
+		cmpps	$0,.L__real_inf(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test		$0x0f,%r9d
+		jnz		.L__log_inf2
+.L__f32:
+
+		movaps	p_x2(%rsp),%xmm9
+		subps	.L__real_one(%rip),%xmm9
+		andps	.L__real_notsign(%rip),%xmm9
+		cmpps	$2,.L__real_threshold(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test	$0x0f,%r9d
+		jnz		.L__near_one2
+.L__f42:
+
+
+.L__finish:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+.L__vlogf_e:
+	movdqa	p_x(%rsp),%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movdqa	%xmm2,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-e values
+	andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+	orps	%xmm3,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	jmp		.L__f1
+
+.L__vlogf_e2:
+		movdqa	p_x2(%rsp),%xmm2
+		cmpps	$0,.L__real_ef(%rip),%xmm2
+		movdqa	%xmm2,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-e values
+		andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+		orps	%xmm3,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		jmp		.L__f12
+
+	.align	16
+.L__near_one:
+# saves 10 cycles
+#      r = x - 1.0;
+	movdqa	%xmm3,p_omask(%rsp)	# save ones mask
+	movaps	p_x(%rsp),%xmm3
+	movaps	.L__real_two(%rip),%xmm2
+	subps	.L__real_one(%rip),%xmm3	   # r
+#      u          = r / (2.0 + r);
+	addps	%xmm3,%xmm2
+	movaps	%xmm3,%xmm1
+	divps	%xmm2,%xmm1		# u
+	movaps	.L__real_ca4(%rip),%xmm4	  #D
+	movaps	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movaps	%xmm3,%xmm6
+	mulps	%xmm1,%xmm6		# correction
+	movdqa	%xmm6,p_corr(%rsp)	# save correction
+#      u          = u + u;
+	addps	%xmm1,%xmm1		#u
+	movaps	%xmm1,%xmm2
+	mulps	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulps	%xmm1,%xmm5		# Cu
+	movaps	%xmm1,%xmm6
+	mulps	%xmm2,%xmm6		# u^3
+	mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulps	%xmm6,%xmm4		#Du^3
+
+	addps	.L__real_ca1(%rip),%xmm2	# +A
+	movaps	%xmm6,%xmm1
+	mulps	%xmm1,%xmm1		# u^6
+	addps	%xmm4,%xmm5		#Cu+Du3
+
+	mulps	%xmm6,%xmm2		#u3(A+Bu2)
+	mulps	%xmm5,%xmm1		#u6(Cu+Du3)
+	addps	%xmm1,%xmm2
+	subps	p_corr(%rsp),%xmm2		# -correction
+
+#      return r + r2;
+	addps	%xmm2,%xmm3
+
+	movdqa	p_omask(%rsp),%xmm6
+	movdqa	%xmm6,%xmm2
+	andnps	%xmm0,%xmm6					# keep the non-nearone values
+	andps	%xmm3,%xmm2					# setup the nearone values
+	orps	%xmm6,%xmm2					# merge
+	movdqa	%xmm2,%xmm0					# and replace
+
+	jmp		.L__f4
+
+
+	.align	16
+.L__near_one2:
+# saves 10 cycles
+#      r = x - 1.0;
+		movdqa	%xmm9,p_omask(%rsp)	# save ones mask
+		movaps	p_x2(%rsp),%xmm3
+		movaps	.L__real_two(%rip),%xmm2
+		subps	.L__real_one(%rip),%xmm3	   # r
+	#      u          = r / (2.0 + r);
+		addps	%xmm3,%xmm2
+		movaps	%xmm3,%xmm7
+		divps	%xmm2,%xmm7		# u
+		movaps	.L__real_ca4(%rip),%xmm4	  #D
+		movaps	.L__real_ca3(%rip),%xmm5	  #C
+	#      correction = r * u;
+		movaps	%xmm3,%xmm6
+		mulps	%xmm7,%xmm6		# correction
+		movdqa	%xmm6,p_corr(%rsp)	# save correction
+	#      u          = u + u;
+		addps	%xmm7,%xmm7		#u
+		movaps	%xmm7,%xmm2
+		mulps	%xmm2,%xmm2		#v =u^2
+	#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+		mulps	%xmm7,%xmm5		# Cu
+		movaps	%xmm7,%xmm6
+		mulps	%xmm2,%xmm6		# u^3
+		mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+		mulps	%xmm6,%xmm4		#Du^3
+
+		addps	.L__real_ca1(%rip),%xmm2	# +A
+		movaps	%xmm6,%xmm7
+		mulps	%xmm7,%xmm7		# u^6
+		addps	%xmm4,%xmm5		#Cu+Du3
+
+		mulps	%xmm6,%xmm2		#u3(A+Bu2)
+		mulps	%xmm5,%xmm7		#u6(Cu+Du3)
+		addps	%xmm7,%xmm2
+		subps	p_corr(%rsp),%xmm2		# -correction
+
+	#      return r + r2;
+		addps	%xmm2,%xmm3
+
+		movdqa	p_omask(%rsp),%xmm6
+		movdqa	%xmm6,%xmm2
+		andnps	%xmm1,%xmm6					# keep the non-nearone values
+		andps	%xmm3,%xmm2					# setup the nearone values
+		orps	%xmm6,%xmm2					# merge
+		movdqa	%xmm2,%xmm1					# and replace
+
+		jmp		.L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+	movdqa	%xmm1,%xmm3
+	andps	%xmm0,%xmm3							# keep the non-error values
+	andnps	.L__real_nan(%rip),%xmm1			# setup the nan values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+# check for +/- 0
+	xorps	%xmm1,%xmm1
+	cmpps	$0,p_x(%rsp),%xmm1	# 0 ?.
+	movmskps	%xmm1,%r9d
+	test		$0x0f,%r9d
+	jz		.L__zn2
+
+	movdqa	%xmm1,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	.L__real_ninf(%rip),%xmm1		# ; C99 specs -inf for +-0
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+
+.L__zn2:
+# check for NaNs
+	movaps	p_x(%rsp),%xmm3
+	andps	.L__real_inf(%rip),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+	movdqa	p_x(%rsp),%xmm4
+	pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+	pcmpeqd	.L__real_zero(%rip),%xmm4
+	pandn	%xmm3,%xmm4							# mask for NaNs
+	movdqa	%xmm4,%xmm2
+	movdqa	p_x(%rsp),%xmm1			# isolate the NaNs
+	pand	%xmm4,%xmm1
+
+	pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+	por		%xmm1,%xmm4							# turn SNaNs to QNaNs
+
+	movdqa	%xmm2,%xmm1
+	andnps	%xmm0,%xmm2							# keep the non-error values
+	orps	%xmm4,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	xorps	%xmm4,%xmm4
+
+	jmp		.L__f2
+
+# handle only +inf	 log(+inf) = inf
+.L__log_inf:
+	movdqa	%xmm3,%xmm1
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	p_x(%rsp),%xmm1			# setup the +inf values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+	jmp		.L__f3
+
+
+.L__z_or_neg2:
+	# deal with negatives first
+		movdqa	%xmm7,%xmm3
+		andps	%xmm1,%xmm3							# keep the non-error values
+		andnps	.L__real_nan(%rip),%xmm7			# setup the nan values
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+	# check for +/- 0
+		xorps	%xmm7,%xmm7
+		cmpps	$0,p_x2(%rsp),%xmm7	# 0 ?.
+		movmskps	%xmm7,%r9d
+		test		$0x0f,%r9d
+		jz		.L__zn22
+
+		movdqa	%xmm7,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-error values
+		andps	.L__real_ninf(%rip),%xmm7		# ; C99 specs -inf for +-0
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+
+.L__zn22:
+	# check for NaNs
+		movaps	p_x2(%rsp),%xmm3
+		andps	.L__real_inf(%rip),%xmm3
+		cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+		movdqa	p_x2(%rsp),%xmm4
+		pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+		pcmpeqd	.L__real_zero(%rip),%xmm4
+		pandn	%xmm3,%xmm4							# mask for NaNs
+		movdqa	%xmm4,%xmm2
+		movdqa	p_x2(%rsp),%xmm7			# isolate the NaNs
+		pand	%xmm4,%xmm7
+
+		pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+		por		%xmm7,%xmm4							# turn SNaNs to QNaNs
+
+		movdqa	%xmm2,%xmm7
+		andnps	%xmm1,%xmm2							# keep the non-error values
+		orps	%xmm4,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		xorps	%xmm4,%xmm4
+
+		jmp		.L__f22
+
+	# handle only +inf	 log(+inf) = inf
+.L__log_inf2:
+		movdqa	%xmm9,%xmm7
+		andnps	%xmm1,%xmm9							# keep the non-error values
+		andps	p_x2(%rsp),%xmm7			# setup the +inf values
+		orps	%xmm9,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+		jmp		.L__f32
+
+
+        .data
+        .align 64
+
+.L__real_zero:				.quad 0x00000000000000000	# 1.0
+					.quad 0x00000000000000000
+.L__real_one:				.quad 0x03f8000003f800000	# 1.0
+					.quad 0x03f8000003f800000
+.L__real_two:				.quad 0x04000000040000000	# 1.0
+					.quad 0x04000000040000000
+.L__real_ninf:				.quad 0x0ff800000ff800000	# -inf
+					.quad 0x0ff800000ff800000
+.L__real_inf:				.quad 0x07f8000007f800000	# +inf
+					.quad 0x07f8000007f800000
+.L__real_nan:				.quad 0x07fc000007fc00000	# NaN
+					.quad 0x07fc000007fc00000
+.L__real_ef:				.quad 0x0402DF854402DF854	# float e
+					.quad 0x0402DF854402DF854
+
+.L__real_sign:				.quad 0x08000000080000000	# sign bit
+					.quad 0x08000000080000000
+.L__real_notsign:			.quad 0x07ffFFFFF7ffFFFFF	# ^sign bit
+					.quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:			.quad 0x00040000000400000	# quiet nan bit
+					.quad 0x00040000000400000
+.L__real_mant:				.quad 0x0007FFFFF007FFFFF	# mantipsa bits
+					.quad 0x0007FFFFF007FFFFF
+.L__real_3c000000:			.quad 0x03c0000003c000000	# /* 0.0078125 = 1/128 */
+					.quad 0x03c0000003c000000
+.L__mask_127:				.quad 0x00000007f0000007f	#
+					.quad 0x00000007f0000007f
+.L__mask_040:				.quad 0x00000004000000040	#
+					.quad 0x00000004000000040
+.L__mask_001:				.quad 0x00000000100000001	#
+					.quad 0x00000000100000001
+
+
+.L__real_threshold:			.quad 0x03CF5C28F3CF5C28F	# .03
+					.quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333317923934e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.25000000037717509602e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3:				.quad 0x03B1249183B124918	# 2.23213998791944806202e-03
+					.quad 0x03B1249183B124918
+.L__real_ca4:				.quad 0x039E401A639E401A6	# 4.34887777707614552256e-04
+					.quad 0x039E401A639E401A6
+.L__real_cb1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333333593622e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.24999999978138668903e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3:				.quad 0x03B124A123B124A12	# 2.23219810758559851206e-03
+			.quad 0x03B124A123B124A12
+.L__real_log2_lead:     .quad 0x03F3170003F317000  # 0.693115234375
+                        .quad 0x03F3170003F317000
+.L__real_log2_tail:     .quad 0x03805FDF43805FDF4  # 0.000031946183
+                        .quad 0x03805FDF43805FDF4
+.L__real_half:		.quad 0x03f0000003f000000	# 1/2
+			.quad 0x03f0000003f000000
+
+
+.L__np_ln__table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3F8FC0A8B0FC03E4		# 1.55041813850402832031e-02
+	.quad	0x3F9F829B0E783300		# 3.07716131210327148438e-02
+	.quad	0x3FA77458F632DCFC		# 4.58095073699951171875e-02
+	.quad	0x3FAF0A30C01162A6		# 6.06245994567871093750e-02
+	.quad	0x3FB341D7961BD1D1		# 7.52233862876892089844e-02
+	.quad	0x3FB6F0D28AE56B4C		# 8.96121263504028320312e-02
+	.quad	0x3FBA926D3A4AD563		# 1.03796780109405517578e-01
+	.quad	0x3FBE27076E2AF2E6		# 1.17783010005950927734e-01
+	.quad	0x3FC0D77E7CD08E59		# 1.31576299667358398438e-01
+	.quad	0x3FC29552F81FF523		# 1.45181953907012939453e-01
+	.quad	0x3FC44D2B6CCB7D1E		# 1.58604979515075683594e-01
+	.quad	0x3FC5FF3070A793D4		# 1.71850204467773437500e-01
+	.quad	0x3FC7AB890210D909		# 1.84922337532043457031e-01
+	.quad	0x3FC9525A9CF456B4		# 1.97825729846954345703e-01
+	.quad	0x3FCAF3C94E80BFF3		# 2.10564732551574707031e-01
+	.quad	0x3FCC8FF7C79A9A22		# 2.23143517971038818359e-01
+	.quad	0x3FCE27076E2AF2E6		# 2.35566020011901855469e-01
+	.quad	0x3FCFB9186D5E3E2B		# 2.47836112976074218750e-01
+	.quad	0x3FD0A324E27390E3		# 2.59957492351531982422e-01
+	.quad	0x3FD1675CABABA60E		# 2.71933674812316894531e-01
+	.quad	0x3FD22941FBCF7966		# 2.83768117427825927734e-01
+	.quad	0x3FD2E8E2BAE11D31		# 2.95464158058166503906e-01
+	.quad	0x3FD3A64C556945EA		# 3.07025015354156494141e-01
+	.quad	0x3FD4618BC21C5EC2		# 3.18453729152679443359e-01
+	.quad	0x3FD51AAD872DF82D		# 3.29753279685974121094e-01
+	.quad	0x3FD5D1BDBF5809CA		# 3.40926527976989746094e-01
+	.quad	0x3FD686C81E9B14AF		# 3.51976394653320312500e-01
+	.quad	0x3FD739D7F6BBD007		# 3.62905442714691162109e-01
+	.quad	0x3FD7EAF83B82AFC3		# 3.73716354370117187500e-01
+	.quad	0x3FD89A3386C1425B		# 3.84411692619323730469e-01
+	.quad	0x3FD947941C2116FB		# 3.94993782043457031250e-01
+	.quad	0x3FD9F323ECBF984C		# 4.05465066432952880859e-01
+	.quad	0x3FDA9CEC9A9A084A		# 4.15827870368957519531e-01
+	.quad	0x3FDB44F77BCC8F63		# 4.26084339618682861328e-01
+	.quad	0x3FDBEB4D9DA71B7C		# 4.36236739158630371094e-01
+	.quad	0x3FDC8FF7C79A9A22		# 4.46287095546722412109e-01
+	.quad	0x3FDD32FE7E00EBD5		# 4.56237375736236572266e-01
+	.quad	0x3FDDD46A04C1C4A1		# 4.66089725494384765625e-01
+	.quad	0x3FDE744261D68788		# 4.75845873355865478516e-01
+	.quad	0x3FDF128F5FAF06ED		# 4.85507786273956298828e-01
+	.quad	0x3FDFAF588F78F31F		# 4.95077252388000488281e-01
+	.quad	0x3FE02552A5A5D0FF		# 5.04556000232696533203e-01
+	.quad	0x3FE0723E5C1CDF40		# 5.13945698738098144531e-01
+	.quad	0x3FE0BE72E4252A83		# 5.23248136043548583984e-01
+	.quad	0x3FE109F39E2D4C97		# 5.32464742660522460938e-01
+	.quad	0x3FE154C3D2F4D5EA		# 5.41597247123718261719e-01
+	.quad	0x3FE19EE6B467C96F		# 5.50647079944610595703e-01
+	.quad	0x3FE1E85F5E7040D0		# 5.59615731239318847656e-01
+	.quad	0x3FE23130D7BEBF43		# 5.68504691123962402344e-01
+	.quad	0x3FE2795E1289B11B		# 5.77315330505371093750e-01
+	.quad	0x3FE2C0E9ED448E8C		# 5.86049020290374755859e-01
+	.quad	0x3FE307D7334F10BE		# 5.94707071781158447266e-01
+	.quad	0x3FE34E289D9CE1D3		# 6.03290796279907226562e-01
+	.quad	0x3FE393E0D3562A1A		# 6.11801505088806152344e-01
+	.quad	0x3FE3D9026A7156FB		# 6.20240390300750732422e-01
+	.quad	0x3FE41D8FE84672AE		# 6.28608644008636474609e-01
+	.quad	0x3FE4618BC21C5EC2		# 6.36907458305358886719e-01
+	.quad	0x3FE4A4F85DB03EBB		# 6.45137906074523925781e-01
+	.quad	0x3FE4E7D811B75BB1		# 6.53301239013671875000e-01
+	.quad	0x3FE52A2D265BC5AB		# 6.61398470401763916016e-01
+	.quad	0x3FE56BF9D5B3F399		# 6.69430613517761230469e-01
+	.quad	0x3FE5AD404C359F2D		# 6.77398800849914550781e-01
+	.quad	0x3FE5EE02A9241675		# 6.85303986072540283203e-01
+	.quad	0x3FE62E42FEFA39EF		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_lead_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x3C7E0000  # 0.015502929688 1
+    .long 0x3CFC1000  # 0.030769348145 2
+    .long 0x3D3BA000  # 0.045806884766 3
+    .long 0x3D785000  # 0.060623168945 4
+    .long 0x3D9A0000  # 0.075195312500 5
+    .long 0x3DB78000  # 0.089599609375 6
+    .long 0x3DD49000  # 0.103790283203 7
+    .long 0x3DF13000  # 0.117767333984 8
+    .long 0x3E06B000  # 0.131530761719 9
+    .long 0x3E14A000  # 0.145141601563 10
+    .long 0x3E226000  # 0.158569335938 11
+    .long 0x3E2FF000  # 0.171813964844 12
+    .long 0x3E3D5000  # 0.184875488281 13
+    .long 0x3E4A9000  # 0.197814941406 14
+    .long 0x3E579000  # 0.210510253906 15
+    .long 0x3E647000  # 0.223083496094 16
+    .long 0x3E713000  # 0.235534667969 17
+    .long 0x3E7DC000  # 0.247802734375 18
+    .long 0x3E851000  # 0.259887695313 19
+    .long 0x3E8B3000  # 0.271850585938 20
+    .long 0x3E914000  # 0.283691406250 21
+    .long 0x3E974000  # 0.295410156250 22
+    .long 0x3E9D3000  # 0.307006835938 23
+    .long 0x3EA30000  # 0.318359375000 24
+    .long 0x3EA8D000  # 0.329711914063 25
+    .long 0x3EAE8000  # 0.340820312500 26
+    .long 0x3EB43000  # 0.351928710938 27
+    .long 0x3EB9C000  # 0.362792968750 28
+    .long 0x3EBF5000  # 0.373657226563 29
+    .long 0x3EC4D000  # 0.384399414063 30
+    .long 0x3ECA3000  # 0.394897460938 31
+    .long 0x3ECF9000  # 0.405395507813 32
+    .long 0x3ED4E000  # 0.415771484375 33
+    .long 0x3EDA2000  # 0.426025390625 34
+    .long 0x3EDF5000  # 0.436157226563 35
+    .long 0x3EE47000  # 0.446166992188 36
+    .long 0x3EE99000  # 0.456176757813 37
+    .long 0x3EEEA000  # 0.466064453125 38
+    .long 0x3EF3A000  # 0.475830078125 39
+    .long 0x3EF89000  # 0.485473632813 40
+    .long 0x3EFD7000  # 0.494995117188 41
+    .long 0x3F012000  # 0.504394531250 42
+    .long 0x3F039000  # 0.513916015625 43
+    .long 0x3F05F000  # 0.523193359375 44
+    .long 0x3F084000  # 0.532226562500 45
+    .long 0x3F0AA000  # 0.541503906250 46
+    .long 0x3F0CF000  # 0.550537109375 47
+    .long 0x3F0F4000  # 0.559570312500 48
+    .long 0x3F118000  # 0.568359375000 49
+    .long 0x3F13C000  # 0.577148437500 50
+    .long 0x3F160000  # 0.585937500000 51
+    .long 0x3F183000  # 0.594482421875 52
+    .long 0x3F1A7000  # 0.603271484375 53
+    .long 0x3F1C9000  # 0.611572265625 54
+    .long 0x3F1EC000  # 0.620117187500 55
+    .long 0x3F20E000  # 0.628417968750 56
+    .long 0x3F230000  # 0.636718750000 57
+    .long 0x3F252000  # 0.645019531250 58
+    .long 0x3F273000  # 0.653076171875 59
+    .long 0x3F295000  # 0.661376953125 60
+    .long 0x3F2B5000  # 0.669189453125 61
+    .long 0x3F2D6000  # 0.677246093750 62
+    .long 0x3F2F7000  # 0.685302734375 63
+    .long 0x3F317000  # 0.693115234375 64
+    .long 0					# for alignment
+
+.L__np_ln_tail_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x35A8B0FC  # 0.000001256848 1
+    .long 0x361B0E78  # 0.000002310522 2
+    .long 0x3631EC66  # 0.000002651266 3
+    .long 0x35C30046  # 0.000001452871 4
+    .long 0x37EBCB0E  # 0.000028108738 5
+    .long 0x37528AE5  # 0.000012549314 6
+    .long 0x36DA7496  # 0.000006510479 7
+    .long 0x3783B715  # 0.000015701671 8
+    .long 0x383F3E68  # 0.000045596069 9
+    .long 0x38297C10  # 0.000040408282 10
+    .long 0x3815B666  # 0.000035694240 11
+    .long 0x38183854  # 0.000036292084 12
+    .long 0x38448108  # 0.000046850211 13
+    .long 0x373539E9  # 0.000010801924 14
+    .long 0x3864A740  # 0.000054515200 15
+    .long 0x387BE3CD  # 0.000060055219 16
+    .long 0x3803B715  # 0.000031403342 17
+    .long 0x380C36AF  # 0.000033429529 18
+    .long 0x3892713A  # 0.000069829126 19
+    .long 0x38AE55D6  # 0.000083129547 20
+    .long 0x38A0FDE8  # 0.000076766883 21
+    .long 0x3862BAE1  # 0.000054056643 22
+    .long 0x3798AAD3  # 0.000018199358 23
+    .long 0x38C5E10E  # 0.000094356117 24
+    .long 0x382D872E  # 0.000041372310 25
+    .long 0x38DEDFAC  # 0.000106274470 26
+    .long 0x38481E9B  # 0.000047712219 27
+    .long 0x38EBFB5E  # 0.000112524940 28
+    .long 0x38783B83  # 0.000059183232 29
+    .long 0x374E1B05  # 0.000012284848 30
+    .long 0x38CA0E11  # 0.000096347307 31
+    .long 0x3891F660  # 0.000069600297 32
+    .long 0x386C9A9A  # 0.000056410769 33
+    .long 0x38777BCD  # 0.000059004688 34
+    .long 0x38A6CED4  # 0.000079540216 35
+    .long 0x38FBE3CD  # 0.000120110439 36
+    .long 0x387E7E01  # 0.000060675669 37
+    .long 0x37D40984  # 0.000025276800 38
+    .long 0x3784C3AD  # 0.000015826745 39
+    .long 0x380F5FAF  # 0.000034182969 40
+    .long 0x38AC47BC  # 0.000082149607 41
+    .long 0x392952D3  # 0.000161479504 42
+    .long 0x37F97073  # 0.000029735476 43
+    .long 0x3865C84A  # 0.000054784388 44
+    .long 0x3979CF17  # 0.000238236375 45
+    .long 0x38C3D2F5  # 0.000093376184 46
+    .long 0x38E6B468  # 0.000110008579 47
+    .long 0x383EBCE1  # 0.000045475437 48
+    .long 0x39186BDF  # 0.000145360347 49
+    .long 0x392F0945  # 0.000166927537 50
+    .long 0x38E9ED45  # 0.000111545007 51
+    .long 0x396B99A8  # 0.000224685878 52
+    .long 0x37A27674  # 0.000019367064 53
+    .long 0x397069AB  # 0.000229275480 54
+    .long 0x39013539  # 0.000123222257 55
+    .long 0x3947F423  # 0.000190690669 56
+    .long 0x3945E10E  # 0.000188712234 57
+    .long 0x38F85DB0  # 0.000118430122 58
+    .long 0x396C08DC  # 0.000225100142 59
+    .long 0x37B4996F  # 0.000021529120 60
+    .long 0x397CEADA  # 0.000241200818 61
+    .long 0x3920261B  # 0.000152729845 62
+    .long 0x35AA4906  # 0.000001268724 63
+    .long 0x3805FDF4  # 0.000031946183 64
+    .long 0					# for alignment
+
+

diff --git a/src/gas/vrsacosf.S b/src/gas/vrsacosf.S
new file mode 100644
index 0000000..1620009
--- /dev/null
+++ b/src/gas/vrsacosf.S

@@ -0,0 +1,2291 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsacosf.s
+#
+# A vector implementation of the cos libm function.
+#
+# Prototype:
+#
+#    vrsa_cosf(int n, float* x, float* y);
+#
+# Computes Cosine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This inlines a routine that computes 4 single precision Cosine values at a time.
+# The four values are passed as packed single in xmm10.
+# The four results are returned as packed singles in xmm10.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# Author: Harsha Jagasia
+# Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff	#Sign bit zero
+			.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000	# 1.0
+			.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+			.quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000	# 0.5
+			.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555	# 0.166666666666
+			.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883	# twobypi
+			.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000	# piby2_1
+			.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331	# piby2_1tail
+			.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000	# piby2_2
+			.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073	# piby2_2tail
+			.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000	# mask for stripping head and tail
+			.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+			.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+			.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+			.quad 0
+.L__reald_one_zero:	.quad 0x00000000100000000	# sin_cos_filter
+			.quad 0
+.L__reald_zero_one:	.quad 0x00000000000000001	#
+			.quad 0
+.L__reald_two_zero:	.quad 0x00000000200000000	#
+			.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+			.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+			.quad 0x03ff0000000000000	#
+.L__real_zero:		.quad 0x00000000000000000	#
+			.quad 0x00000000000000000	#
+.L__real_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+
+.Lcosarray:
+	.quad	0x03FA5555555502F31		#  0.0416667			c1
+	.quad	0x03FA5555555502F31
+	.quad	0x0BF56C16BF55699D7		# -0.00138889			c2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x03EFA015C50A93B49		#  2.48016e-005			c3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x0BE92524743CC46B8		# -2.75573e-007			c4
+	.quad	0x0BE92524743CC46B8
+
+.Lsinarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BFC555555545E87D
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x03F811110DF01232D
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x0BE92524743CC46B8
+
+.Lcossinarray:
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BF56C16BF55699D7		#				c2
+	.quad	0x03F811110DF01232D
+	.quad	0x03EFA015C50A93B49		#				c3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x0BE92524743CC46B8		#				c4
+	.quad	0x03EC6DBE4AD1572D5
+
+
+.align 8
+	.Levencos_oddsin_tbl:
+		.quad	.Lcoscos_coscos_piby4		# 0		*	; Done
+		.quad	.Lcoscos_cossin_piby4		# 1		+	; Done
+		.quad	.Lcoscos_sincos_piby4		# 2			; Done
+		.quad	.Lcoscos_sinsin_piby4		# 3		+	; Done
+
+		.quad	.Lcossin_coscos_piby4		# 4			; Done
+		.quad	.Lcossin_cossin_piby4		# 5		*	; Done
+		.quad	.Lcossin_sincos_piby4		# 6			; Done
+		.quad	.Lcossin_sinsin_piby4		# 7			; Done
+
+		.quad	.Lsincos_coscos_piby4		# 8			; Done
+		.quad	.Lsincos_cossin_piby4		# 9			; TBD
+		.quad	.Lsincos_sincos_piby4		# 10		*	; Done
+		.quad	.Lsincos_sinsin_piby4		# 11			; Done
+
+		.quad	.Lsinsin_coscos_piby4		# 12			; Done
+		.quad	.Lsinsin_cossin_piby4		# 13		+	; Done
+		.quad	.Lsinsin_sincos_piby4		# 14			; Done
+		.quad	.Lsinsin_sinsin_piby4		# 15		*	; Done
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        .weak vrsa_cosf_
+        .set vrsa_cosf_,__vrsa_cosf__
+        .weak vrsa_cosf__
+        .set vrsa_cosf__,__vrsa_cosf__
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#FORTRAN subroutine implementation of array cos
+#VRSA_COSF(N,X,Y)
+#C equivalent*/
+#void vrsa_cosf__(int * n, double *x, double *y)
+#{
+#       vrsa_cosf(*n,x,y);
+#}
+
+.globl __vrsa_cosf__
+    .type   __vrsa_cosf__,@function
+__vrsa_cosf__:
+    mov         (%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ	p_temp,0		# temporary for get/put bits operation
+.equ	p_temp1,0x10		# temporary for get/put bits operation
+
+.equ	save_xmm6,0x20		# temporary for get/put bits operation
+.equ	save_xmm7,0x30		# temporary for get/put bits operation
+.equ	save_xmm8,0x40		# temporary for get/put bits operation
+.equ	save_xmm9,0x50		# temporary for get/put bits operation
+.equ	save_xmm0,0x60		# temporary for get/put bits operation
+.equ	save_xmm11,0x70		# temporary for get/put bits operation
+.equ	save_xmm12,0x80		# temporary for get/put bits operation
+.equ	save_xmm13,0x90		# temporary for get/put bits operation
+.equ	save_xmm14,0x0A0		# temporary for get/put bits operation
+.equ	save_xmm15,0x0B0		# temporary for get/put bits operation
+
+.equ	r,0x0C0			# pointer to r for remainder_piby2
+.equ	rr,0x0D0		# pointer to r for remainder_piby2
+.equ	region,0x0E0		# pointer to r for remainder_piby2
+
+.equ	r1,0x0F0		# pointer to r for remainder_piby2
+.equ	rr1,0x0100		# pointer to r for remainder_piby2
+.equ	region1,0x0110		# pointer to r for remainder_piby2
+
+.equ	p_temp2,0x0120		# temporary for get/put bits operation
+.equ	p_temp3,0x0130		# temporary for get/put bits operation
+
+.equ	p_temp4,0x0140		# temporary for get/put bits operation
+.equ	p_temp5,0x0150		# temporary for get/put bits operation
+
+.equ	p_original,0x0160		# original x
+.equ	p_mask,0x0170		# original x
+.equ	p_sign,0x0180		# original x
+
+.equ	p_original1,0x0190		# original x
+.equ	p_mask1,0x01A0		# original x
+.equ	p_sign1,0x01B0		# original x
+
+.equ	save_r12,0x01C0		# temporary for get/put bits operation
+.equ	save_r13,0x01D0		# temporary for get/put bits operation
+
+.equ	save_xa,0x01E0		#qword
+.equ	save_ya,0x01F0		#qword
+
+.equ	save_nv,0x0200		#qword
+.equ	p_iter,0x0210		# qword	storage for number of loop iterations
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.globl vrsa_cosf
+    .type   vrsa_cosf,@function
+vrsa_cosf:
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# parameters are passed in by Linux as:
+# rcx - int n
+# rdx - double *x
+# r8  - double *y
+
+
+	sub		$0x0228,%rsp
+	mov	%r12,save_r12(%rsp)	# save r12
+	mov	%r13,save_r13(%rsp)	# save r13
+
+
+
+
+
+#START PROCESS INPUT
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$2,%rax				# get number of iterations
+	jz		.L__vrsa_cleanup			# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi				# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrsa_top:
+# build the input _m128d
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movlps	(%rsi),%xmm0
+	movhps	8(%rsi),%xmm0
+
+	prefetch	32(%rsi)
+	add		$16,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# V4 START
+
+	movhlps		%xmm0,%xmm8
+	cvtps2pd	%xmm0,%xmm10			# convert input to double.
+	cvtps2pd	%xmm8,%xmm1			# convert input to double.
+
+movdqa	%xmm10,%xmm6
+movdqa	%xmm1,%xmm7
+movapd	.L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd 	%xmm2,%xmm10	#Unsign
+andpd 	%xmm2,%xmm1	#Unsign
+
+movd	%xmm10,%rax				#rax is lower arg
+movhpd	%xmm10, p_temp+8(%rsp)			#
+mov    	p_temp+8(%rsp),%rcx			#rcx = upper arg
+
+movd	%xmm1,%r8				#r8 is lower arg
+movhpd	%xmm1, p_temp1+8(%rsp)			#
+mov    	p_temp1+8(%rsp),%r9			#r9 = upper arg
+
+movdqa	%xmm10,%xmm12
+movdqa	%xmm1,%xmm13
+
+pcmpgtd		%xmm6,%xmm12
+pcmpgtd		%xmm7,%xmm13
+movdqa		%xmm12,%xmm6
+movdqa		%xmm13,%xmm7
+psrldq		$4,%xmm12
+psrldq		$4,%xmm13
+psrldq		$8,%xmm6
+psrldq		$8,%xmm7
+
+mov 	$0x3FE921FB54442D18,%rdx			#piby4	+
+mov	$0x411E848000000000,%r10			#5e5	+
+movapd	.L__real_3fe0000000000000(%rip),%xmm4		#0.5 for later use +
+
+por	%xmm6,%xmm12
+por	%xmm7,%xmm13
+
+movapd	%xmm10,%xmm2				#x0
+movapd	%xmm1,%xmm3				#x1
+movapd	%xmm10,%xmm6				#x0
+movapd	%xmm1,%xmm7				#x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax
+	jae	.Lfirst_or_next3_arg_gt_5e5
+
+	cmp	%r10,%rcx
+	jae	.Lsecond_or_next2_arg_gt_5e5
+
+	cmp	%r10,%r8
+	jae	.Lthird_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfourth_arg_gt_5e5
+
+
+#      /* Find out what multiple of piby2 */
+#        npi2  = (int)(x * twobypi + 0.5);
+	movapd	.L__real_3fe45f306dc9c883(%rip),%xmm10
+	mulpd	%xmm10,%xmm2						# * twobypi
+	mulpd	%xmm10,%xmm3						# * twobypi
+
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%r8						# Region
+	movd	%xmm5,%r9						# Region
+
+	mov 	.L__reald_one_zero(%rip),%rdx			#compare value for cossin path
+	mov	%r8,%r10
+	mov	%r9,%r11
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm1,%xmm7						# t-rhead
+
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign
+# xmm5  = npi2 (int), xmm1 =rhead, xmm9 =rtail,  r9 = region, r11 = region, r13 = Sign
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	xor	%rax,%r10
+	xor	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+
+# GET_BITS_DP64(rhead-rtail, uy);			   		; originally only rhead
+# xmm4  = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1  =rhead, xmm9 =rtail
+	movapd	%xmm10,%xmm6						# rhead
+	movapd	%xmm1,%xmm7						# rhead
+
+	subpd	%xmm8,%xmm10						# r = rhead - rtail
+	subpd	%xmm9,%xmm1						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2				# move r for r2
+	movapd	%xmm1,%xmm3				# move r for r2
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+	leaq	.Levencos_oddsin_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+#	    %xmm11,,%xmm9 xmm13
+
+
+	movlpd	%xmm10,r(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm10,%xmm10		#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm8 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm12 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	%ecx,region+4(%rsp)			# store upper region
+       movsd	%xmm6,%xmm10
+       subsd	%xmm0,%xmm10					# xmm10 = r=(rhead-rtail)
+       subsd	%xmm10,%xmm6					# rr=rhead-r
+       subsd	%xmm0,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	%xmm10,r+8(%rsp)			# store upper r
+       movlpd	%xmm6,rr+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_lower_naninf
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r(%rsp),%rdi			#Restore lower fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_cosf_lower_naninf:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			# region =0
+
+.align 16
+0:
+	jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+	movhlps %xmm10,%xmm6		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+	movd	%xmm10,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	jmp 	0f
+
+.L__vrs4_cosf_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%rax
+	mov	 %rax,r(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_upper_naninf_of_both_gt_5e5
+
+
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm6,%rdi			#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrs4_cosf_upper_naninf_of_both_gt_5e5:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%rcx
+	mov	 %rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+#   %xmm9,,%xmm5 xmm11, xmm13
+
+	movhpd	 %xmm10,r+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm7 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store upper region
+
+        subsd	%xmm0,%xmm6					# xmm10 = r=(rhead-rtail)
+
+        movlpd	 %xmm6,r(%rsp)				# store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_upper_naninf
+
+	mov	 %r8,p_temp(%rsp)
+	mov	 %r9,p_temp2(%rsp)
+	movapd	 %xmm1,p_temp1(%rsp)
+	movapd	 %xmm3,p_temp3(%rsp)
+	movapd	 %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r+8(%rsp),%rdi				#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	p_temp(%rsp),%r8
+	mov	p_temp2(%rsp),%r9
+	movapd	p_temp1(%rsp),%xmm1
+	movapd	p_temp3(%rsp),%xmm3
+	movapd	p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_cosf_upper_naninf:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%rcx
+	mov	 %rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r8
+	jae	.Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4			#Restore 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm3						# * twobypi
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm5,region1(%rsp)						# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm1,%xmm7						# t-rhead
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       subpd	%xmm9,%xmm1						# r = rhead - rtail
+       movapd	%xmm1,r1(%rsp)
+
+       jmp	.L__vrs4_cosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use 	%xmm11,,%xmm9 xmm13
+# 	%xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	%xmm10,r(%rsp)
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+	cmp	%r10,%r9
+	jae	.Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movlpd	 %xmm1,r1(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm1,%xmm1			#Needed since we want to work on upper arg
+	movhlps	%xmm3,%xmm3
+	movhlps	%xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r9d				# r9d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r9d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r9d,region1+4(%rsp)			# store upper region
+
+       subsd	%xmm10,%xmm7					# xmm1 = r=(rhead-rtail)
+
+       movlpd	 %xmm7,r1+8(%rsp)			# store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_lower_naninf_higher
+
+	lea	 region1(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r1(%rsp),%rdi				#Restore lower fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_cosf_lower_naninf_higher:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%r8
+	mov	 %r8,r1(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.L__vrs4_cosf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+	movhlps %xmm1,%xmm7		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher
+
+	mov	  %r9,p_temp1(%rsp)			#Save upper arg
+	lea	  region1(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	  r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm1,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp1(%rsp),%r9			#Restore upper arg
+
+	jmp 	0f
+
+.L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher:				#lower arg is nan/inf
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%r8
+	mov	 %r8,r1(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher
+
+	lea	 region1+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm7,%rdi			#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%r9
+	mov	 %r9,r1+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			#region = 0
+
+.align 16
+0:
+
+	jmp 	.L__vrs4_cosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	 %xmm10,r(%rsp)
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd	 %xmm1,r1+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r8d				# r8d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r8d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+        mov	 %r8d,region1(%rsp)			# store lower region
+
+        subsd	%xmm10,%xmm7					# xmm10 = r=(rhead-rtail)
+
+        movlpd	 %xmm7,r1(%rsp)				# store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_cosf_upper_naninf_higher
+
+	lea	 region1+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r1+8(%rsp),%rdi	#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_cosf_upper_naninf_higher:
+	mov	 $0x00008000000000000,%r11
+	or	 %r11,%r9
+	mov	 %r9,r1+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp	.L__vrs4_cosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_cosf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	r(%rsp),%xmm10
+	movapd	r1(%rsp),%xmm1
+
+	mov	region(%rsp),%r8
+	mov	region1(%rsp),%r9
+	mov 	.L__reald_one_zero(%rip),%rdx		#compare value for cossin path
+
+	mov 	%r8,%r10
+	mov 	%r9,%r11
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	xor	%rax,%r10
+	xor	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+	leaq	.Levencos_oddsin_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrsa_cosf_cleanup:
+
+	movapd	p_sign(%rsp),%xmm10
+	movapd	p_sign1(%rsp),%xmm1
+	xorpd	%xmm4,%xmm10			# (+) Sign
+	xorpd	%xmm5,%xmm1			# (+) Sign
+
+	cvtpd2ps %xmm10,%xmm0
+	cvtpd2ps %xmm1,%xmm11
+	movlhps	 %xmm11,%xmm0
+
+# NEW
+
+.L__vrsa_bottom1:
+# store the result _m128d
+	mov	 save_ya(%rsp),%rdi	# get y_array pointer
+	movlps	 %xmm0,(%rdi)
+	movhps	 %xmm0,8(%rdi)
+
+	prefetch	32(%rdi)
+	add		$16,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+	mov	p_iter(%rsp),%rax	# get number of iterations
+	sub	$1,%rax
+	mov	%rax,p_iter(%rsp)	# save number of iterations
+	jnz	.L__vrsa_top
+
+# see if we need to do any extras
+	mov	save_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz	.L__vrsa_cleanup
+
+.L__final_check:
+
+# NEW
+
+	mov	save_r12(%rsp),%r12	# restore r12
+	mov	save_r13(%rsp),%r13	# restore r13
+
+	add	$0x0228,%rsp
+	ret
+
+#NEW
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# we assume that rdx is pointing at the next x array element, r8 at the next y array element.
+# The number of values left is in save_nv
+
+.align	16
+.L__vrsa_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__final_check         # exit if not
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+
+# START WORKING FROM HERE
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorps		 %xmm0,%xmm0
+	movss		 %xmm0,p_temp+4(%rsp)
+	movlps		 %xmm0,p_temp+8(%rsp)
+
+
+	mov		 (%rsi),%ecx			# we know there's at least one
+	mov	 	 %ecx,p_temp(%rsp)
+	cmp		 $2,%rax
+	jl		 .L__vrsacg
+
+	mov		 4(%rsi),%ecx			# do the second value
+	mov	 	 %ecx,p_temp+4(%rsp)
+	cmp		 $3,%rax
+	jl		 .L__vrsacg
+
+	mov		 8(%rsi),%ecx			# do the third value
+	mov	 	 %ecx,p_temp+8(%rsp)
+
+.L__vrsacg:
+	mov		$4,%rdi				# parameter for N
+	lea		p_temp(%rsp),%rsi		# &x parameter
+	lea		p_temp2(%rsp),%rdx 		# &y parameter
+	call		vrsa_cosf@PLT			# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		save_ya(%rsp),%rdi
+	mov		save_nv(%rsp),%rax			# get number of values
+
+	mov	 	p_temp2(%rsp),%ecx
+	mov		%ecx,(%rdi)			# we know there's at least one
+	cmp		$2,%rax
+	jl		.L__vrsacgf
+
+	mov	 	p_temp2+4(%rsp),%ecx
+	mov		%ecx,4(%rdi)			# do the second value
+	cmp		$3,%rax
+	jl		.L__vrsacgf
+
+	mov	 	p_temp2+8(%rsp),%ecx
+	mov		 %ecx,8(%rdi)			# do the third value
+
+.L__vrsacgf:
+	jmp		.L__final_check
+
+#NEW
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+	movapd	%xmm2,%xmm0					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0	;trash r
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	addpd	.Lcosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm2,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd   %xmm0,%xmm4					# + t
+	subpd   %xmm11,%xmm5					# + t
+
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4		# s4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5		# s4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8		# s2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4		# s4+x2s3
+	addpd	.Lsincosarray+0x20(%rip),%xmm5		# s4+x2s3
+	addpd	.Lsincosarray(%rip),%xmm8		# s2+x2s1
+	addpd	.Lsincosarray(%rip),%xmm9		# s2+x2s1
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm0,%xmm0				# move high x4 for cos term
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	%xmm10,%xmm6				# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos
+	movhlps	%xmm3,%xmm13				# move high r for cos
+
+	movhlps	%xmm4,%xmm8				# xmm4 = sin , xmm8 = cos
+	movhlps	%xmm5,%xmm9				# xmm4 = sin , xmm8 = cos
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm7,%xmm5				# sin *x3
+
+	mulsd	%xmm0,%xmm8				# cos *x4
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	#-t=r-1.0
+
+	addsd	%xmm10,%xmm4				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+	subsd   %xmm12,%xmm8				# cos+t
+	subsd   %xmm13,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:
+
+	movapd	.Lsincosarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lcossinarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lsincosarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lcossinarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm3,%xmm7				# sincos term upper x2 for x3
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4		# s3+x2s4
+	addpd	.Lcossinarray+0x20(%rip),%xmm5		# s3+x2s4
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2s2
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2s2
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm0,%xmm0				# move high x4 for cos term
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	mulpd	%xmm1,%xmm7
+
+	mulsd	%xmm10,%xmm6				# get low x3 for sin term (cossin)
+	movhlps	%xmm7,%xmm7				# get high x3 for sin term (sincos)
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos term
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+
+	movhlps	%xmm2,%xmm12				# move high r for cos (cossin)
+
+
+	movhlps	%xmm4,%xmm8				# xmm8 = cos , xmm4 = sin	(cossin)
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos	(sincos)
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm11,%xmm5				# cos *x4
+	mulsd	%xmm0,%xmm8				# cos *x4
+	mulsd	%xmm7,%xmm9				# sin *x3
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	# -t=r-1.0
+
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term    (sincos)
+
+	addsd	%xmm10,%xmm4				# sin + x	+
+	addsd	%xmm11,%xmm9				# sin + x	+
+
+	subsd   %xmm12,%xmm8				# cos+t
+	subsd   %xmm3,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	.Lcossinarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lcossinarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lcossinarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lcossinarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm2,%xmm6				# move x2 for x4
+	movapd	%xmm3,%xmm7				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4		# s4+x2s3
+	addpd	.Lcossinarray+0x20(%rip),%xmm5		# s4+x2s3
+	addpd	.Lcossinarray(%rip),%xmm8		# s2+x2s1
+	addpd	.Lcossinarray(%rip),%xmm9		# s2+x2s1
+
+	mulpd	%xmm0,%xmm4				# x4(s4+x2s3)
+	mulpd	%xmm11,%xmm5				# x4(s4+x2s3)
+
+	mulpd	%xmm10,%xmm6				# get low x3 for sin term
+	mulpd	%xmm1,%xmm7				# get low x3 for sin term
+	movhlps	%xmm6,%xmm6				# move low x2 for x3 for sin term
+	movhlps	%xmm7,%xmm7				# move low x2 for x3 for sin term
+
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos terms
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm4,%xmm12				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm13				# xmm9 = sin , xmm5 = cos
+
+	mulsd	%xmm6,%xmm12				# sin *x3
+	mulsd	%xmm7,%xmm13				# sin *x3
+	mulsd	%xmm0,%xmm4				# cos *x4
+	mulsd	%xmm11,%xmm5				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	#-t=r-1.0
+
+	movhlps	%xmm10,%xmm0				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term
+							# Reverse 10 and 0
+
+	addsd	%xmm0,%xmm12				# sin + x
+	addsd	%xmm11,%xmm13				# sin + x
+
+	subsd   %xmm2,%xmm4				# cos+t
+	subsd   %xmm3,%xmm5				# cos+t
+
+	movlhps	%xmm12,%xmm4
+	movlhps	%xmm13,%xmm5
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+
+	movapd	.Lcossinarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lsincosarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lcossinarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lsincosarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm2,%xmm7				# upper x2 for x3 for sin term (sincos)
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4		# s3+x2s4
+	addpd	.Lsincosarray+0x20(%rip),%xmm5		# s3+x2s4
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2s2
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2s2
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+
+	movsd	%xmm3,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	mulpd	%xmm10,%xmm7
+
+	mulsd	%xmm1,%xmm6				# get low x3 for sin term (cossin)
+	movhlps	%xmm7,%xmm7				# get high x3 for sin term (sincos)
+
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos term
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm3,%xmm12				# move high r for cos (cossin)
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos	(sincos)
+	movhlps	%xmm5,%xmm9				# xmm9 = cos , xmm5 = sin	(cossin)
+
+	mulsd	%xmm0,%xmm4				# cos *x4
+	mulsd	%xmm6,%xmm5				# sin *x3
+	mulsd	%xmm7,%xmm8				# sin *x3
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	# -t=r-1.0
+
+	movhlps	%xmm10,%xmm11				# move high x for x for sin term    (sincos)
+
+	subsd	%xmm2,%xmm4				# cos-(-t)
+	subsd	%xmm12,%xmm9				# cos-(-t)
+
+	addsd   %xmm11,%xmm8				# sin + x
+	addsd   %xmm1,%xmm5				# sin + x
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrsa_cosf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: 	SIN
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr:	COS
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm0					# x2	; SIN
+	movapd	%xmm3,%xmm11					# x2	; COS
+	movapd	%xmm3,%xmm1					# copy of x2 for x4
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm0					# x4
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm3,%xmm1					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4				# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5				# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8					# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9				# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+
+	mulpd	%xmm0,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm1,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm1,%xmm5					# x4 * zc
+
+	addpd	%xmm10,%xmm4					# +x
+	subpd   %xmm11,%xmm5					# +t
+
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: 	COS
+# p_sign1  = Sign, xmm1  = r, xmm3 = %xmm7,%r2 =rr:	SIN
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm0					# x2	; COS
+	movapd	%xmm3,%xmm11					# x2	; SIN
+	movapd	%xmm2,%xmm10					# copy of x2 for x4
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# s4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# s2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# s4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# s2*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# s3+x2c4
+	addpd	.Lcosarray(%rip),%xmm8				# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9				# s1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm10,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm11,%xmm5					# x4(s3+x2s4)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm10,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x3 * zc
+
+	subpd	%xmm0,%xmm4					# +t
+	addpd   %xmm1,%xmm5					# +x
+
+	jmp 	.L__vrsa_cosf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4:						#Derive from cossin_coscos
+	movhlps	%xmm2,%xmm0					# x2 for 0.5x2 for upper cos
+	movsd	%xmm2,%xmm6					# lower x2 for x3 for lower sin
+	movapd	%xmm3,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					#  t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lsincosarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcosarray(%rip),%xmm9				# c2+x2c1
+
+	movapd	%xmm12,%xmm2					# upper=x4
+	movsd	%xmm6,%xmm2					# lower=x2
+	mulsd	%xmm10,%xmm2					# lower=x2*x
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# upper= x4 * zc
+								# lower=x3 * zs
+	mulpd	%xmm13,%xmm5					# x4 * zc
+
+	movlhps	%xmm7,%xmm10					#
+	addpd	%xmm10,%xmm4					# +x for lower sin, +t for upper cos
+	subpd   %xmm11,%xmm5					# -(-t)
+
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4:					#Derive from cossin_coscos
+	movsd	%xmm2,%xmm0					# x2 for 0.5x2 for lower cos
+	movapd	%xmm3,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcossinarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcossinarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcossinarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcosarray(%rip),%xmm9				# c2+x2c1
+
+	mulpd	%xmm10,%xmm2					# upper=x3 for sin
+	mulsd	%xmm10,%xmm2					# lower=x4 for cos
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# lower= x4 * zc
+								# upper= x3 * zs
+	mulpd	%xmm13,%xmm5					# x4 * zc
+
+	movsd	%xmm7,%xmm10
+	addpd	%xmm10,%xmm4					# +x for upper sin, +t for lower cos
+	subpd   %xmm11,%xmm5					# -(-t)
+
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+	movhlps	%xmm3,%xmm0					# x2 for 0.5x2 for upper cos
+	movapd	%xmm2,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	%xmm3,%xmm6					# lower x2 for x3 for sin
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lsincosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcosarray(%rip),%xmm8				# c2+x2c1
+	addpd	.Lsincosarray(%rip),%xmm9			# c2+x2c1
+
+	movapd	%xmm13,%xmm3					# upper=x4
+	movsd	%xmm6,%xmm3					# lower x2
+	mulsd	%xmm1,%xmm3					# lower x2*x
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm12,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# upper= x4 * zc
+								# lower=x3 * zs
+
+	movlhps	%xmm7,%xmm1
+	addpd	%xmm1,%xmm5					# +x for lower sin, +t for upper cos
+	subpd   %xmm11,%xmm4					# -(-t)
+
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:						# Derived from sincos_coscos
+
+	movhlps	%xmm3,%xmm0					# x2
+	movapd	%xmm3,%xmm7
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsincosarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8				# c1+x2c2
+	addpd	.Lsincosarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+	movapd	%xmm13,%xmm3					# upper x4 for cos
+	movsd	%xmm7,%xmm3					# lower x2 for sin
+	mulsd	%xmm1,%xmm3					# lower x3=x2*x for sin
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm11,%xmm1					# t for upper cos and x for lower sin
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# upper=x4 * zc
+								# lower=x3 * zs
+
+	addpd   %xmm10,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +t upper, +x lower
+
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+	movsd	%xmm3,%xmm0					# x2 for 0.5x2 for lower cos
+	movapd	%xmm2,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lcossinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lcossinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0	# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcossinarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcosarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcossinarray(%rip),%xmm9			# c2+x2c1
+
+	mulpd	%xmm1,%xmm3					# upper=x3 for sin
+	mulsd	%xmm1,%xmm3					# lower=x4 for cos
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm12,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# lower= x4 * zc
+								# upper= x3 * zs
+
+	movsd	%xmm7,%xmm1
+	subpd   %xmm11,%xmm4					# -(-t)
+	addpd	%xmm1,%xmm5					# +x for upper sin, +t for lower cos
+
+
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4:						# Derived from sincos_coscos
+
+	movsd	%xmm3,%xmm0					# x2
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcossinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcossinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcossinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcossinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm1,%xmm6
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm6,%xmm11					# upper =t ; lower =x
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	addpd   %xmm10,%xmm4					# +x
+	addpd	%xmm11,%xmm5					# +t lower, +x upper
+
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4:						# Derived from sincos_coscos
+
+	movhlps	%xmm2,%xmm0					# x2
+	movapd	%xmm2,%xmm7
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsincosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9				# c1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	movapd	%xmm12,%xmm2					# upper x4 for cos
+	movsd	%xmm7,%xmm2					# lower x2 for sin
+	mulsd	%xmm10,%xmm2					# lower x3=x2*x for sin
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm11,%xmm10					# t for upper cos and x for lower sin
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# upper=x4 * zc
+								# lower=x3 * zs
+
+	addpd	%xmm1,%xmm5					# +x
+	addpd   %xmm10,%xmm4					# +t upper, +x lower
+
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:						# Derived from sincos_coscos
+
+	movsd	%xmm2,%xmm0					# x2
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lcossinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcossinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lcossinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm10,%xmm2					# upper x3 for sin
+	mulsd	%xmm10,%xmm2					# lower x4 for cos
+
+	movhlps	%xmm10,%xmm6
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm6,%xmm11
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	addpd	%xmm1,%xmm5					# +x
+	addpd   %xmm11,%xmm4					# +t lower, +x upper
+
+	jmp 	.L__vrsa_cosf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+  #x2 = x * x;
+  #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))));
+
+  #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4));
+
+
+	movapd	%xmm2,%xmm0					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+
+	mulpd	%xmm2,%xmm0					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	mulpd	%xmm10,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# x3
+
+	addpd	.Lsinarray(%rip),%xmm8				# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9				# c1+x2c2
+
+	mulpd	%xmm0,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm11,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# x3 * zs
+
+	addpd	%xmm10,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+
+	jmp 	.L__vrsa_cosf_cleanup

diff --git a/src/gas/vrsaexpf.S b/src/gas/vrsaexpf.S
new file mode 100644
index 0000000..399943e
--- /dev/null
+++ b/src/gas/vrsaexpf.S

@@ -0,0 +1,766 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsaexpf.s
+#
+# An array implementation of the expf libm function.
+#
+# Prototype:
+#
+#    void vrsa_expf(int n, float *x, float *y);
+#
+#   Computes e raised to the x power for an array of input values.
+#   Places the results into the supplied y array.
+#  This routine implemented in single precision.  It is slightly
+#  less accurate than the double precision version, but it will
+#  be better for vectorizing.
+# Does not perform error handling, but does return C99 values for error
+# inputs.   Denormal results are truncated to 0.
+
+# This array version is basically a unrolling of the by4 scalar single
+# routine.  The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+# The scheduling is done by trial and error.  The resulting code represents
+# the best time of many variations.  It would seem more interleaving could
+# be done, as there is a long stretch of the second computation that is not
+# interleaved.  But moving any of this code forward makes the routine
+# slower.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ	p_ux,0x00		#qword
+.equ	p_ux2,0x010		#qword
+
+.equ	save_xa,0x020		#qword
+.equ	save_ya,0x028		#qword
+.equ	save_nv,0x030		#qword
+
+
+.equ	p_iter,0x038		# qword	storage for number of loop iterations
+
+.equ	p_j,0x040		# second temporary for get/put bits operation
+.equ	p_m,0x050		#qword
+.equ	p_j2,0x060		# second temporary for exponent multiply
+.equ	p_m2,0x070		#qword
+.equ	save_rbx,0x080		#qword
+
+
+.equ	stack_size,0x098
+
+       .weak vrsa_expf_
+       .set vrsa_expf_,__vrsa_expf__
+       .weak vrsa_expf__
+       .set vrsa_expf__,__vrsa_expf__
+
+# parameters are passed in by gcc as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array expf
+#**     VRSA_EXPF(N,X,Y)
+# C equivalent*/
+#void vrsa_expf__(int * n, float *x, float *y)
+#{
+#       vrsa_expf(*n,x,y);
+#}
+.globl __vrsa_expf__
+    .type   __vrsa_expf__,@function
+__vrsa_expf__:
+    mov         (%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+.globl vrsa_expf
+    .type   vrsa_expf,@function
+vrsa_expf:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)
+
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+	mov		%rdi,save_nv(%rsp)	# save number of values
+
+# see if too few values to call the main loop
+	shr		$3,%rax						# get number of iterations
+	jz		.L__vsa_cleanup				# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$3,%rax
+	sub		%rax,%rdi						# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+# In this second version, process the array 8 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+	movaps	.L__real_thirtytwo_by_log2(%rip),%xmm3	#
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movups	(%rsi),%xmm0
+		movups	16(%rsi),%xmm6
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+	movaps	 %xmm0,p_ux(%rsp)
+        maxps   .L__real_m8192(%rip),%xmm0
+		movaps	 %xmm6,p_ux2(%rsp)
+                maxps   .L__real_m8192(%rip),%xmm6
+
+
+#        /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+#      Step 1. Reduce the argument.
+	#    r = x * thirtytwo_by_logbaseof2;
+	movaps	.L__real_thirtytwo_by_log2(%rip),%xmm2	#
+
+	mulps	%xmm0,%xmm2
+	xor		%rax,%rax
+        minps   .L__real_8192(%rip),%xmm2
+		movaps	.L__real_thirtytwo_by_log2(%rip),%xmm5	#
+
+		mulps	%xmm6,%xmm5
+                minps   .L__real_8192(%rip),%xmm5   # protect against large input values
+
+
+#    /* Set n = nearest integer to r */
+	cvtps2dq	%xmm2,%xmm3
+	lea		.L__two_to_jby32_table(%rip),%rdi
+	cvtdq2ps	%xmm3,%xmm1
+
+		cvtps2dq	%xmm5,%xmm8
+		cvtdq2ps	%xmm8,%xmm7
+#    r1 = x - n * logbaseof2_by_32_lead;
+	movaps	.L__real_log2_by_32_head(%rip),%xmm2
+	mulps	%xmm1,%xmm2
+	subps	%xmm2,%xmm0				# r1 in xmm0,
+
+		movaps	.L__real_log2_by_32_head(%rip),%xmm5
+		mulps	%xmm7,%xmm5
+		subps	%xmm5,%xmm6				# r1 in xmm6,
+
+
+#    r2 = - n * logbaseof2_by_32_lead;
+	mulps	.L__real_log2_by_32_tail(%rip),%xmm1
+		mulps	.L__real_log2_by_32_tail(%rip),%xmm7
+
+#    j = n & 0x0000001f;
+	movdqa	%xmm3,%xmm4
+	movdqa	.L__int_mask_1f(%rip),%xmm2
+		movdqa	%xmm8,%xmm9
+		movdqa	.L__int_mask_1f(%rip),%xmm5
+	pand	%xmm4,%xmm2
+	movdqa	 %xmm2,p_j(%rsp)
+#    f1 = two_to_jby32_lead_table[j);
+
+		pand	%xmm9,%xmm5
+		movdqa	 %xmm5,p_j2(%rsp)
+
+#    *m = (n - j) / 32;
+	psubd	%xmm2,%xmm4
+	psrad	$5,%xmm4
+	movdqa	 %xmm4,p_m(%rsp)
+		psubd	%xmm5,%xmm9
+		psrad	$5,%xmm9
+		movdqa	 %xmm9,p_m2(%rsp)
+
+	movaps	%xmm0,%xmm3
+	addps	%xmm1,%xmm3				# r = r1+ r2
+
+	mov		p_j(%rsp),%eax 			# get an individual index
+		movaps	%xmm6,%xmm8
+        mov             (%rdi,%rax,4),%edx              # get the f1 value
+		addps	%xmm7,%xmm8				# r = r1+ r2
+	mov		 %edx,p_j(%rsp) 			# save the f1 value
+
+#      Step 2. Compute the polynomial.
+#    q = r1 +
+#              r*r*( 5.00000000000000008883e-01 +
+#                      r*( 1.66666666665260878863e-01 +
+#                      r*( 4.16666666662260795726e-02 +
+#                      r*( 8.33336798434219616221e-03 +
+#                      r*( 1.38889490863777199667e-03 )))));
+#    q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+#    q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision
+	movaps	%xmm3,%xmm4
+	movaps	%xmm3,%xmm2
+	mulps	%xmm2,%xmm2			# x*x
+	mulps	.L__real_1_24(%rip),%xmm4	# /24
+
+	mov		p_j+4(%rsp),%eax 			# get an individual index
+        mov             (%rdi,%rax,4),%edx              # get the f1 value
+	mov		 %edx,p_j+4(%rsp) 			# save the f1 value
+
+	addps 	.L__real_1_6(%rip),%xmm4		# +1/6
+
+	mulps	%xmm2,%xmm3			# x^3
+	mov		p_j+8(%rsp),%eax 			# get an individual index
+        mov             (%rdi,%rax,4),%edx              # get the f1 value
+	mov		 %edx,p_j+8(%rsp) 			# save the f1 value
+	mulps	.L__real_half(%rip),%xmm2	# x^2/2
+	mov		p_j+12(%rsp),%eax 			# get an individual index
+        mov             (%rdi,%rax,4),%edx              # get the f1 value
+	mov		 %edx,p_j+12(%rsp) 			# save the f1 value
+	mulps	%xmm3,%xmm4			# *x^3
+		mov		p_j2(%rsp),%eax 			# get an individual index
+        	mov             (%rdi,%rax,4),%edx              # get the f1 value
+		mov		 %edx,p_j2(%rsp) 			# save the f1 value
+
+	addps	%xmm4,%xmm1			# +r2
+
+	addps	%xmm2,%xmm1			# + x^2/2
+	addps	%xmm1,%xmm0			# +r1
+
+		movaps	%xmm8,%xmm9
+		mov		p_j2+4(%rsp),%eax 			# get an individual index
+		movaps	%xmm8,%xmm5
+		mulps	%xmm5,%xmm5			# x*x
+		mulps	.L__real_1_24(%rip),%xmm9	# /24
+
+       		mov             (%rdi,%rax,4),%edx              # get the f1 value
+		mov		 %edx,p_j2+4(%rsp) 			# save the f1 value
+
+# deal with infinite or denormal results
+        movdqa  p_m(%rsp),%xmm1
+        movdqa  p_m(%rsp),%xmm2
+        pcmpgtd .L__int_127(%rip),%xmm2
+        pminsw  .L__int_128(%rip),%xmm1 # ceil at 128
+        movmskps        %xmm2,%eax
+        test            $0x0f,%eax
+
+        paddd   .L__int_127(%rip),%xmm1 # add bias
+
+#    *z2 = f2 + ((f1 + f2) * q);
+        mulps   p_j(%rsp),%xmm0         # * f1
+        addps   p_j(%rsp),%xmm0         # + f1
+        jnz             .L__exp_largef
+.L__check1:
+
+
+	pxor	%xmm2,%xmm2				# floor at 0
+	pmaxsw	%xmm2,%xmm1
+
+	pslld	$23,%xmm1					# build 2^n
+
+	movaps	%xmm1,%xmm2
+
+
+
+# check for infinity or nan
+	movaps	p_ux(%rsp),%xmm1
+	andps	.L__real_infinity(%rip),%xmm1
+        cmpps   $0,.L__real_infinity(%rip),%xmm1
+	movmskps	%xmm1,%ebx
+	test		$0x0f,%ebx
+
+
+# end of splitexp
+#        /* Scale (z1 + z2) by 2.0**m */
+#      Step 3. Reconstitute.
+
+	mulps	%xmm2,%xmm0						# result *= 2^n
+
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them.  But it adds cycles for normal cases
+# to handle events that are supposed to be exceptions.
+#  Using this branch with the
+# check above results in faster code for the normal cases.
+# And branch mispredict penalties should only come into
+# play for nans and infinities.
+	jnz		.L__exp_naninf
+.L__vsa_bottom1:
+
+		#    q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision
+		addps 	.L__real_1_6(%rip),%xmm9		# +1/6
+
+		mulps	%xmm5,%xmm8			# x^3
+		mov		p_j2+8(%rsp),%eax 			# get an individual index
+        	mov             (%rdi,%rax,4),%edx              # get the f1 value
+		mov		 %edx,p_j2+8(%rsp) 			# save the f1 value
+		mulps	.L__real_half(%rip),%xmm5	# x^2/2
+		mulps	%xmm8,%xmm9			# *x^3
+
+		mov		p_j2+12(%rsp),%eax 			# get an individual index
+        	mov             (%rdi,%rax,4),%edx              # get the f1 value
+		mov		 %edx,p_j2+12(%rsp) 			# save the f1 value
+		addps	%xmm9,%xmm7			# +r2
+
+		addps	%xmm5,%xmm7			# + x^2/2
+		addps	%xmm7,%xmm6			# +r1
+
+
+		# deal with infinite or denormal results
+                movdqa  p_m2(%rsp),%xmm7
+                movdqa  p_m2(%rsp),%xmm5
+                pcmpgtd .L__int_127(%rip),%xmm5
+                pminsw  .L__int_128(%rip),%xmm7 # ceil at 128
+                movmskps        %xmm5,%eax
+                test            $0x0f,%eax
+
+                paddd   .L__int_127(%rip),%xmm7 # add bias
+
+        #    *z2 = f2 + ((f1 + f2) * q);
+                mulps   p_j2(%rsp),%xmm6                # * f1
+                addps   p_j2(%rsp),%xmm6                # + f1
+                jnz             .L__exp_largef2
+.L__check2:
+
+		pxor	%xmm5,%xmm5				# floor at 0
+		pmaxsw	%xmm5,%xmm7
+
+		pslld	$23,%xmm7					# build 2^n
+
+		movaps	%xmm7,%xmm5
+
+
+		# check for infinity or nan
+		movaps	p_ux2(%rsp),%xmm7
+		andps	.L__real_infinity(%rip),%xmm7
+	        cmpps   $0,.L__real_infinity(%rip),%xmm7
+		movmskps	%xmm7,%ebx
+		test		$0x0f,%ebx
+
+
+		# end of splitexp
+		#        /* Scale (z1 + z2) by 2.0**m */
+		#      Step 3. Reconstitute.
+
+		mulps	%xmm5,%xmm6						# result *= 2^n
+#__vsa_bottom1:
+# store the result _m128d
+	mov		save_ya(%rsp),%rdi	# get y_array pointer
+	movups	%xmm0,(%rdi)
+
+		jnz			.L__exp_naninf2
+
+.L__vsa_bottom2:
+
+	prefetch	64(%rdi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+		movups	%xmm6,-16(%rdi)
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vsa_top
+
+
+# see if we need to do any extras
+	mov		save_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz		.L__vsa_cleanup
+
+
+#
+.L__final_check:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+# at least one of the numbers needs special treatment
+.L__exp_naninf:
+	lea		p_ux(%rsp),%rcx
+	call  .L__fexp_naninf
+	jmp		.L__vsa_bottom1
+.L__exp_naninf2:
+	lea		p_ux2(%rsp),%rcx
+	movaps	%xmm6,%xmm0
+	call  .L__fexp_naninf
+	movaps	%xmm0,%xmm6
+	jmp		.L__vsa_bottom2
+
+#  deal with nans and infinities
+# This subroutine checks a packed single for nans and infinities and
+# produces the proper result from the exceptional inputs
+# Register assumptions:
+# Inputs:
+# rbx - mask of errors
+# xmm0 - computed result vector
+# Outputs:
+# xmm0 - new result vector
+# %rax,rdx,rbx,%xmm2 all modified.
+
+.L__fexp_naninf:
+	movaps	 %xmm0,p_j+8(%rsp)	# save the computed values
+	test	$1,%ebx					# first value?
+	jz		.L__Lni2
+	mov		0(%rcx),%edx	# get the input
+	call	.L__naninf
+	mov		 %edx,p_j+8(%rsp)	# copy the result
+.L__Lni2:
+	test	$2,%ebx					# second value?
+	jz		.L__Lni3
+	mov		4(%rcx),%edx	# get the input
+	call	.L__naninf
+	mov		 %edx,p_j+12(%rsp)	# copy the result
+.L__Lni3:
+	test	$4,%ebx					# third value?
+	jz		.L__Lni4
+	mov		8(%rcx),%edx	# get the input
+	call	.L__naninf
+	mov		 %edx,p_j+16(%rsp)	# copy the result
+.L__Lni4:
+	test	$8,%ebx					# fourth value?
+	jz		.L__Lnie
+	mov		12(%rcx),%edx	# get the input
+	call	.L__naninf
+	mov		 %edx,p_j+20(%rsp)	# copy the result
+.L__Lnie:
+	movaps	p_j+8(%rsp),%xmm0	# get the answers
+	ret
+
+#
+# a simple subroutine to check a scalar input value for infinity
+# or NaN and return the correct result
+# expects input in .Land,%edx returns value in edx.  Destroys eax.
+.L__naninf:
+	mov		$0x0007FFFFF,%eax
+	test	%eax,%edx
+	jnz		.L__enan					# jump if mantissa not zero, so it's a NaN
+# inf
+	mov		%edx,%eax
+	rcl		$1,%eax
+	jnc		.L__r				# exp(+inf) = inf
+	xor		%edx,%edx				# exp(-inf) = 0
+	jmp		.L__r
+
+#NaN
+.L__enan:
+	mov		$0x000400000,%eax	# convert to quiet
+	or		%eax,%edx
+.L__r:
+	ret
+
+
+	.align	16
+# we jump here when we have an odd number of exp calls to make at the
+# end
+.L__vsa_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax	        # are there any values
+        jz              .L__final_check         # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+	xorps		%xmm0,%xmm0
+        movaps          %xmm0,p_j(%rsp)
+        movaps          %xmm0,p_j+16(%rsp)
+
+	mov		(%rsi),%ecx		# we know there's at least one
+	mov	 	%ecx,p_j(%rsp)
+	cmp		$2,%rax
+	jl		.L__vsacg
+
+	mov		4(%rsi),%ecx			# do the second value
+	mov	 	%ecx,p_j+4(%rsp)
+	cmp		$3,%rax
+	jl		.L__vsacg
+
+	mov		8(%rsi),%ecx			# do the third value
+	mov	 	%ecx,p_j+8(%rsp)
+	cmp		$4,%rax
+	jl		.L__vsacg
+
+	mov		12(%rsi),%ecx			# do the fourth value
+	mov	 	%ecx,p_j+12(%rsp)
+	cmp		$5,%rax
+	jl		.L__vsacg
+
+	mov		16(%rsi),%ecx			# do the fifth value
+	mov	 	%ecx,p_j+16(%rsp)
+	cmp		$6,%rax
+	jl		.L__vsacg
+
+	mov		20(%rsi),%ecx			# do the sixth value
+	mov	 	%ecx,p_j+20(%rsp)
+	cmp		$7,%rax
+	jl		.L__vsacg
+
+	mov		24(%rsi),%ecx			# do the last value
+	mov	 	%ecx,p_j+24(%rsp)
+
+.L__vsacg:
+	mov		$8,%rdi				# parameter for N
+	lea		p_j(%rsp),%rsi	# &x parameter
+	lea		p_j2(%rsp),%rdx	# &y parameter
+        call		vrsa_expf@PLT  # call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		save_ya(%rsp),%rdi
+	mov		save_nv(%rsp),%rax	# get number of values
+	mov	 	p_j2(%rsp),%ecx
+	mov		%ecx,(%rdi)			# we know there's at least one
+	cmp		$2,%rax
+	jl		.L__vsacgf
+
+	mov	 	p_j2+4(%rsp),%ecx
+	mov		%ecx,4(%rdi)			# do the second value
+	cmp		$3,%rax
+	jl		.L__vsacgf
+
+	mov	 	p_j2+8(%rsp),%ecx
+	mov		%ecx,8(%rdi)			# do the second value
+	cmp		$4,%rax
+	jl		.L__vsacgf
+
+	mov	 	p_j2+12(%rsp),%ecx
+	mov		%ecx,12(%rdi)			# do the second value
+	cmp		$5,%rax
+	jl		.L__vsacgf
+
+	mov	 	p_j2+16(%rsp),%ecx
+	mov		%ecx,16(%rdi)		# do the second value
+	cmp		$6,%rax
+	jl		.L__vsacgf
+
+	mov	 	p_j2+20(%rsp),%ecx
+	mov		%ecx,20(%rdi)			# do the second value
+	cmp		$7,%rax
+	jl		.L__vsacgf
+
+	mov	 	p_j2+24(%rsp),%ecx
+	mov		%ecx,24(%rdi)			# do the last value
+
+.L__vsacgf:
+	jmp		.L__final_check
+
+        .align  16
+#  deal with m > 127.  In some instances, rounding during calculations
+#  can result in infinity when it shouldn't.  For these cases, we scale
+#  m down, and scale the mantissa up.
+
+.L__exp_largef:
+        movdqa    %xmm0,p_j(%rsp)    # save the mantissa portion
+        movdqa    %xmm1,p_m(%rsp)       # save the exponent portion
+        mov       %eax,%ecx              # save the error mask
+        test    $1,%ecx                  # first value?
+        jz       .L__Lf2
+        mov      p_m(%rsp),%edx # get the exponent
+        sub      $1,%edx                # scale it down
+        mov      %edx,p_m(%rsp)       # save the exponent
+        movss   p_j(%rsp),%xmm3     # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss    %xmm3,p_j(%rsp)   # save the mantissa
+.L__Lf2:
+        test    $2,%ecx                 # second value?
+        jz       .L__Lf3
+        mov      p_m+4(%rsp),%edx # get the exponent
+        sub      $1,%edx                # scale it down
+        mov      %edx,p_m+4(%rsp)       # save the exponent
+        movss   p_j+4(%rsp),%xmm3     # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+4(%rsp)   # save the mantissa
+.L__Lf3:
+        test    $4,%ecx                 # third value?
+        jz       .L__Lf4
+        mov      p_m+8(%rsp),%edx # get the exponent
+        sub      $1,%edx                # scale it down
+        mov      %edx,p_m+8(%rsp)       # save the exponent
+        movss   p_j+8(%rsp),%xmm3     # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss    %xmm3,p_j+8(%rsp)   # save the mantissa
+.L__Lf4:
+        test    $8,%ecx                                 # fourth value?
+        jz       .L__Lfe
+        mov      p_m+12(%rsp),%edx        # get the exponent
+        sub      $1,%edx                 # scale it down
+        mov      %edx,p_m+12(%rsp)      # save the exponent
+        movss   p_j+12(%rsp),%xmm3    # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+12(%rsp)  # save the mantissa
+.L__Lfe:
+        movaps  p_j(%rsp),%xmm0      # restore the mantissa portion back
+        movdqa  p_m(%rsp),%xmm1         # restore the exponent portion
+        jmp             .L__check1
+        .align  16
+
+.L__exp_largef2:
+        movdqa    %xmm6,p_j(%rsp)    # save the mantissa portion
+        movdqa    %xmm7,p_m2(%rsp)      # save the exponent portion
+        mov             %eax,%ecx                                       # save the error mask
+        test    $1,%ecx                                 # first value?
+        jz              .L__Lf22
+        mov             p_m2+0(%rsp),%edx       # get the exponent
+        sub             $1,%edx                                         # scale it down
+        mov               %edx,p_m2+0(%rsp)     # save the exponent
+        movss   p_j+0(%rsp),%xmm3    # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+0(%rsp)  # save the mantissa
+.L__Lf22:
+        test    $2,%ecx                                 # second value?
+        jz              .L__Lf32
+        mov             p_m2+4(%rsp),%edx       # get the exponent
+        sub             $1,%edx                                         # scale it down
+        mov               %edx,p_m2+4(%rsp)     # save the exponent
+        movss   p_j+4(%rsp),%xmm3    # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+4(%rsp)  # save the mantissa
+.L__Lf32:
+        test    $4,%ecx                                 # third value?
+        jz              .L__Lf42
+        mov             p_m2+8(%rsp),%edx       # get the exponent
+        sub             $1,%edx                                         # scale it down
+        mov               %edx,p_m2+8(%rsp)     # save the exponent
+        movss   p_j+8(%rsp),%xmm3    # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+8(%rsp)  # save the mantissa
+.L__Lf42:
+        test    $8,%ecx                                 # fourth value?
+        jz              .L__Lfe2
+        mov             p_m2+12(%rsp),%edx      # get the exponent
+        sub             $1,%edx                                         # scale it down
+        mov               %edx,p_m2+12(%rsp)    # save the exponent
+        movss   p_j+12(%rsp),%xmm3   # get the mantissa
+        mulss   .L__real_two(%rip),%xmm3        # scale it up
+        movss     %xmm3,p_j+12(%rsp) # save the mantissa
+.L__Lfe2:
+        movaps  p_j(%rsp),%xmm6      # restore the mantissa portion back
+        movdqa  p_m2(%rsp),%xmm7                # restore the exponent portion
+        jmp             .L__check2
+
+
+	.data
+	.align	64
+.L__real_half:			.long 0x03f000000	# 1/2
+				.long 0x03f000000
+				.long 0x03f000000
+				.long 0x03f000000
+.L__real_two:                   .long 0x40000000        # 2
+                                .long 0x40000000
+                                .long 0x40000000
+                                .long 0x40000000
+.L__real_8192:                  .long 0x46000000        # 8192, to protect against really large numbers
+                                .long 0x46000000
+                                .long 0x46000000
+                                .long 0x46000000
+.L__real_m8192:                 .long 0xC6000000        # -8192, to protect against really small numbers
+                                .long 0xC6000000
+                                .long 0xC6000000
+                                .long 0xC6000000
+.L__real_thirtytwo_by_log2: 	.long 0x04238AA3B	# thirtytwo_by_log2
+				.long 0x04238AA3B
+				.long 0x04238AA3B
+				.long 0x04238AA3B
+.L__real_log2_by_32:		.long 0x03CB17218	# log2_by_32
+				.long 0x03CB17218
+				.long 0x03CB17218
+				.long 0x03CB17218
+.L__real_log2_by_32_head:	.long 0x03CB17000	# log2_by_32
+				.long 0x03CB17000
+				.long 0x03CB17000
+				.long 0x03CB17000
+.L__real_log2_by_32_tail:	.long 0x0B585FDF4	# log2_by_32
+				.long 0x0B585FDF4
+				.long 0x0B585FDF4
+				.long 0x0B585FDF4
+.L__real_1_6:			.long 0x03E2AAAAB	# 0.16666666666 used in polynomial
+				.long 0x03E2AAAAB
+				.long 0x03E2AAAAB
+				.long 0x03E2AAAAB
+.L__real_1_24: 			.long 0x03D2AAAAB	# 0.041666668 used in polynomial
+				.long 0x03D2AAAAB
+				.long 0x03D2AAAAB
+				.long 0x03D2AAAAB
+.L__real_1_120:			.long 0x03C088889	# 0.0083333338 used in polynomial
+				.long 0x03C088889
+				.long 0x03C088889
+				.long 0x03C088889
+.L__real_infinity:		.long 0x07f800000	# infinity
+				.long 0x07f800000
+				.long 0x07f800000
+				.long 0x07f800000
+.L__int_mask_1f:		.long 0x00000001f
+				.long 0x00000001f
+				.long 0x00000001f
+				.long 0x00000001f
+.L__int_128:			.long 0x000000080
+				.long 0x000000080
+				.long 0x000000080
+				.long 0x000000080
+.L__int_127:			.long 0x00000007f
+				.long 0x00000007f
+				.long 0x00000007f
+				.long 0x00000007f
+
+.L__two_to_jby32_table:
+	.long	0x03F800000 		# 1.0000000000000000
+	.long	0x03F82CD87		# 1.0218971486541166
+	.long	0x03F85AAC3		# 1.0442737824274138
+	.long	0x03F88980F		# 1.0671404006768237
+	.long	0x03F8B95C2		# 1.0905077326652577
+	.long	0x03F8EA43A		# 1.1143867425958924
+	.long	0x03F91C3D3		# 1.1387886347566916
+	.long	0x03F94F4F0		# 1.1637248587775775
+	.long	0x03F9837F0		# 1.1892071150027210
+	.long	0x03F9B8D3A		# 1.2152473599804690
+	.long	0x03F9EF532		# 1.2418578120734840
+	.long	0x03FA27043		# 1.2690509571917332
+	.long	0x03FA5FED7		# 1.2968395546510096
+	.long	0x03FA9A15B		# 1.3252366431597413
+	.long	0x03FAD583F		# 1.3542555469368927
+	.long	0x03FB123F6		# 1.3839098819638320
+	.long	0x03FB504F3		# 1.4142135623730951
+	.long	0x03FB8FBAF		# 1.4451808069770467
+	.long	0x03FBD08A4		# 1.4768261459394993
+	.long	0x03FC12C4D		# 1.5091644275934228
+	.long	0x03FC5672A		# 1.5422108254079407
+	.long	0x03FC9B9BE		# 1.5759808451078865
+	.long	0x03FCE248C		# 1.6104903319492543
+	.long	0x03FD2A81E		# 1.6457554781539649
+	.long	0x03FD744FD		# 1.6817928305074290
+	.long	0x03FDBFBB8		# 1.7186192981224779
+	.long	0x03FE0CCDF		# 1.7562521603732995
+	.long	0x03FE5B907		# 1.7947090750031072
+	.long	0x03FEAC0C7		# 1.8340080864093424
+	.long	0x03FEFE4BA		# 1.8741676341103000
+	.long	0x03FF5257D		# 1.9152065613971474
+	.long	0x03FFA83B3		# 1.9571441241754002
+	.long 0					# for alignment
+
+
+

diff --git a/src/gas/vrsalog10f.S b/src/gas/vrsalog10f.S
new file mode 100644
index 0000000..003eaf1
--- /dev/null
+++ b/src/gas/vrsalog10f.S

@@ -0,0 +1,1149 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsalogf.s
+#
+# An array implementation of the logf libm function.
+#
+# Prototype:
+#
+#    void vrsa_log10f(int n, float *x, float *y);
+#
+#   Computes the natural log of x.
+#   Places the results into the supplied y array.
+# Does not perform error handling, but does return C99 values for error
+# inputs.   Denormal results are truncated to 0.
+
+# This array version is basically a unrolling of the by4 scalar single
+# routine.  The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+       .weak vrsa_log10f_
+       .set vrsa_log10f_,__vrsa_log10f__
+       .weak vrsa_log10f__
+       .set vrsa_log10f__,__vrsa_log10f__
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array logf
+#**     VRSA_LOG10F(N,X,Y)
+# C equivalent*/
+#void vrsa_log10f__(int * n, float *x, float *y)
+#{
+#       vrsa_log10f(*n,x,y);
+#}
+.globl __vrsa_log10f__
+    .type   __vrsa_log10f__,@function
+__vrsa_log10f__:
+    mov         (%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+
+
+# define local variable storage offsets
+.equ	p_x,0			# save x
+.equ	p_idx,0x010		# xmmword index
+.equ	p_z1,0x020		# xmmword index
+.equ	p_q,0x030		# xmmword index
+.equ	p_corr,0x040		# xmmword index
+.equ	p_omask,0x050		# xmmword index
+
+
+.equ	p_x2,0x0100		# save x
+.equ	p_idx2,0x0110		# xmmword index
+.equ	p_z12,0x0120		# xmmword index
+.equ	p_q2,0x0130		# xmmword index
+
+.equ	save_xa,0x0140		#qword
+.equ	save_ya,0x0148		#qword
+.equ	save_nv,0x0150		#qword
+.equ	p_iter,0x0158		# qword	storage for number of loop iterations
+
+.equ	save_rbx,0x0160		#
+.equ	save_rdi,0x0168		#qword
+.equ	save_rsi,0x0170		#qword
+
+.equ	p2_temp,0x0180		#qword
+.equ	p2_temp1,0x01a0		#qword
+
+.equ	stack_size,0x01c8
+
+
+
+
+# parameters are passed in by gcc as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+.globl vrsa_log10f
+    .type   vrsa_log10f,@function
+vrsa_log10f:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rdi
+
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$3,%rax						# get number of iterations
+	jz		.L__vsa_cleanup				# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$3,%rax
+	sub		%rax,%rdi						# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+# In this second version, process the array 8 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movups	(%rsi),%xmm0
+		movups	16(%rsi),%xmm12
+#	movhps	.LQWORD,%xmm0 PTR [rsi+8]
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+
+# check e as a special case
+	movdqa	%xmm0,p_x(%rsp)	# save x
+		movdqa	%xmm12,p_x2(%rsp)	# save x
+	movdqa	%xmm0,%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movmskps	%xmm2,%r9d
+
+		movdqa	%xmm12,%xmm9
+		movaps	%xmm12,%xmm7
+
+#
+# compute the index into the log tables
+#
+	movdqa	%xmm0,%xmm3
+	movaps	%xmm0,%xmm1
+	psrld	$23,%xmm3
+
+	#
+	# compute the index into the log tables
+	#
+		psrld	$23,%xmm9
+		subps	.L__real_one(%rip),%xmm7
+		psubd	.L__mask_127(%rip),%xmm9
+	subps	.L__real_one(%rip),%xmm1
+	psubd	.L__mask_127(%rip),%xmm3
+		cvtdq2ps	%xmm9,%xmm13			# xexp
+
+		movdqa	%xmm12,%xmm9
+		pand	.L__real_mant(%rip),%xmm9
+		xor		%r8,%r8
+		movdqa	%xmm9,%xmm8
+		movaps	.L__real_half(%rip),%xmm11							# .5
+	cvtdq2ps	%xmm3,%xmm6			# xexp
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+	xor		%r8,%r8
+	movdqa	%xmm3,%xmm2
+	movaps	.L__real_half(%rip),%xmm5							# .5
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrld	$16,%xmm3
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	movdqa	%xmm3,%xmm4
+		psrld	$16,%xmm9
+		movdqa	%xmm9,%xmm10
+		psrld	$1,%xmm9
+	psrld	$1,%xmm3
+	paddd	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm4
+	paddd	%xmm4,%xmm3
+	cvtdq2ps	%xmm3,%xmm1
+	#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+		paddd	.L__mask_040(%rip),%xmm9
+		pand	.L__mask_001(%rip),%xmm10
+		paddd	%xmm10,%xmm9
+		cvtdq2ps	%xmm9,%xmm7
+	packssdw	%xmm3,%xmm3
+	movq	%xmm3,p_idx(%rsp)
+		packssdw	%xmm9,%xmm9
+		movq	%xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+	movdqa	%xmm0,%xmm3
+	orps		.L__real_half(%rip),%xmm2
+
+
+	mulps	.L__real_3c000000(%rip),%xmm1				# f1 = index/128
+	# reduce and get u
+
+
+	subps	%xmm1,%xmm2											# f2 = f - f1
+	mulps	%xmm2,%xmm5
+	addps	%xmm5,%xmm1
+
+	divps	%xmm1,%xmm2				# u
+
+		movdqa	%xmm12,%xmm9
+		orps		.L__real_half(%rip),%xmm8
+
+
+		mulps	.L__real_3c000000(%rip),%xmm7				# f1 = index/128
+		subps	%xmm7,%xmm8											# f2 = f - f1
+		mulps	%xmm8,%xmm11
+		addps	%xmm11,%xmm7
+
+
+        mov             p_idx(%rsp),%rcx                        # get the indexes
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        mov             -256(%rdx,%r8,4),%ebx           # get the f1 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_z1(%rsp)                        # save the f1 values
+
+
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        or              -256(%rdx,%r8,4),%ebx           # get the f1 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_z1+8(%rsp)                      # save the f1 value
+
+		divps	%xmm7,%xmm8				# u
+		lea		.L__np_ln_lead_table(%rip),%rdx
+                mov             p_idx2(%rsp),%rcx                       # get the indexes
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                mov             -256(%rdx,%r8,4),%ebx           # get the f1 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_z12(%rsp)                       # save the f1 values
+
+
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                or              -256(%rdx,%r8,4),%ebx           # get the f1 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_z12+8(%rsp)                     # save the f1 value
+
+# solve for ln(1+u)
+	movaps	%xmm2,%xmm1				# u
+	mulps	%xmm2,%xmm2				# u^2
+	movaps	%xmm2,%xmm5
+	movaps	.L__real_cb3(%rip),%xmm3
+	mulps	%xmm2,%xmm3				#Cu2
+	mulps	%xmm1,%xmm5				# u^3
+	addps	.L__real_cb2(%rip),%xmm3 #B+Cu2
+	movaps	%xmm2,%xmm4
+	mulps	%xmm5,%xmm4				# u^5
+	movaps	.L__real_log2_lead(%rip),%xmm2
+
+	mulps	.L__real_cb1(%rip),%xmm5 #Au3
+	addps	%xmm5,%xmm1				# u+Au3
+	mulps	%xmm3,%xmm4				# u5(B+Cu2)
+
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	addps	%xmm4,%xmm1				# poly
+
+# recombine
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+        mov             %cx,%r8w
+        shr             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+        mov             %cx,%r8w
+        shr             $16,%rcx
+        or              -256(%rdx,%r8,4),%ebx           # get the f2 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_q(%rsp)                         # save the f2 value
+
+
+        mov             %cx,%r8w
+        shr             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+        mov             %cx,%r8w
+        mov             -256(%rdx,%r8,4),%ebx           # get the f2 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_q+8(%rsp)                       # save the f2 value
+
+	addps	p_q(%rsp),%xmm1 #z2	+=q
+
+	movaps	p_z1(%rsp),%xmm0			# z1  values
+
+	mulps	%xmm6,%xmm2
+	addps	%xmm2,%xmm0				#r1
+	movaps	%xmm0,%xmm2
+	mulps	.L__real_log2_tail(%rip),%xmm6
+	addps	%xmm6,%xmm1				#r2
+	movaps	%xmm1,%xmm3
+
+#	logef to log10f
+	mulps 	.L__real_log10e_tail(%rip),%xmm1
+	mulps 	.L__real_log10e_tail(%rip),%xmm0
+	mulps 	.L__real_log10e_lead(%rip),%xmm3
+	mulps 	.L__real_log10e_lead(%rip),%xmm2
+	addps 	%xmm1,%xmm0
+	addps 	%xmm3,%xmm0
+	addps	%xmm2,%xmm0
+#	addps	%xmm1,%xmm0
+
+
+
+# check for e
+#	test		$0x0f,%r9d
+	#	jnz			.L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+	xorps	%xmm1,%xmm1
+	cmpps	$1,p_x(%rsp),%xmm1	# 0 greater than =?. catches NaNs also.
+	movmskps	%xmm1,%r9d
+	cmp		$0x0f,%r9d
+	jnz		.L__z_or_neg
+
+.L__f2:
+##  if +inf
+	movaps	p_x(%rsp),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test		$0x0f,%r9d
+	jnz		.L__log_inf
+.L__f3:
+
+	movaps	p_x(%rsp),%xmm3
+	subps	.L__real_one(%rip),%xmm3
+	andps	.L__real_notsign(%rip),%xmm3
+	cmpps	$2,.L__real_threshold(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test	$0x0f,%r9d
+	jnz		.L__near_one
+.L__f4:
+
+# store the result _m128d
+	mov		save_ya(%rsp),%rdi	# get y_array pointer
+	movups	%xmm0,(%rdi)
+
+# finish the second set of calculations
+
+	# solve for ln(1+u)
+		movaps	%xmm8,%xmm7				# u
+		mulps	%xmm8,%xmm8				# u^2
+		movaps	%xmm8,%xmm11
+
+		movaps	.L__real_cb3(%rip),%xmm9
+		mulps	%xmm8,%xmm9				#Cu2
+		mulps	%xmm7,%xmm11				# u^3
+		addps	.L__real_cb2(%rip),%xmm9 #B+Cu2
+		movaps	%xmm8,%xmm10
+		mulps	%xmm11,%xmm10				# u^5
+		movaps	.L__real_log2_lead(%rip),%xmm8
+
+		mulps	.L__real_cb1(%rip),%xmm11 #Au3
+		addps	%xmm11,%xmm7				# u+Au3
+		mulps	%xmm9,%xmm10				# u5(B+Cu2)
+		addps	%xmm10,%xmm7				# poly
+
+
+	# recombine
+		lea		.L__np_ln_tail_table(%rip),%rdx
+		mov		p_idx2(%rsp),%rcx 			# get the indexes
+		mov             %cx,%r8w
+                shr             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+                mov             %cx,%r8w
+                shr             $16,%rcx
+                or              -256(%rdx,%r8,4),%ebx           # get the f2 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_q2(%rsp)                        # save the f2 value
+
+
+                mov             %cx,%r8w
+                shr             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+                mov             %cx,%r8w
+                mov             -256(%rdx,%r8,4),%ebx           # get the f2 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_q2+8(%rsp)                      # save the f2 value
+
+                addps   p_q2(%rsp),%xmm7 #z2    +=q
+		movaps	p_z12(%rsp),%xmm1			# z1  values
+
+		mulps	%xmm13,%xmm8
+		addps	%xmm8,%xmm1				#r1
+		movaps	%xmm1,%xmm8
+		mulps	.L__real_log2_tail(%rip),%xmm13
+		addps	%xmm13,%xmm7				#r2
+		movaps	%xmm7,%xmm9
+	#	logef to log10f
+		mulps 	.L__real_log10e_tail(%rip),%xmm7
+		mulps 	.L__real_log10e_tail(%rip),%xmm1
+		mulps 	.L__real_log10e_lead(%rip),%xmm9
+		mulps 	.L__real_log10e_lead(%rip),%xmm8
+		addps 	%xmm7,%xmm1
+		addps 	%xmm9,%xmm1
+		addps	%xmm8,%xmm1
+#		addps	%xmm7,%xmm1
+
+	# check e as a special case
+#		movaps	p_x2(%rsp),%xmm10
+#		cmpps	$0,.L__real_ef(%rip),%xmm10
+#		movmskps	%xmm10,%r9d
+	# check for e
+#		test		$0x0f,%r9d
+#		jnz			.L__vlogf_e2
+.L__f12:
+
+	# check for negative numbers or zero
+		xorps	%xmm7,%xmm7
+		cmpps	$1,p_x2(%rsp),%xmm7	# 0 greater than =?. catches NaNs also.
+		movmskps	%xmm7,%r9d
+		cmp		$0x0f,%r9d
+		jnz		.L__z_or_neg2
+
+.L__f22:
+	##  if +inf
+		movaps	p_x2(%rsp),%xmm9
+		cmpps	$0,.L__real_inf(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test		$0x0f,%r9d
+		jnz		.L__log_inf2
+.L__f32:
+
+		movaps	p_x2(%rsp),%xmm9
+		subps	.L__real_one(%rip),%xmm9
+		andps	.L__real_notsign(%rip),%xmm9
+		cmpps	$2,.L__real_threshold(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test	$0x0f,%r9d
+		jnz		.L__near_one2
+.L__f42:
+
+
+	prefetch	64(%rsi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+                movups  %xmm1,-16(%rdi)
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vsa_top
+
+
+# see if we need to do any extras
+	mov		save_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz		.L__vsa_cleanup
+
+
+#
+.L__final_check:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+
+
+	.align	16
+# we jump here when we have an odd number of log calls to make at the
+# end
+.L__vsa_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__final_check         # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+	xorps		%xmm0,%xmm0
+        movaps   %xmm0,p2_temp(%rsp)
+        movaps   %xmm0,p2_temp+16(%rsp)
+
+        mov             (%rsi),%ecx                     # we know there's at least one
+        mov             %ecx,p2_temp(%rsp)
+        cmp             $2,%rax
+        jl              .L__vsacg
+
+        mov             4(%rsi),%ecx                    # do the second value
+        mov             %ecx,p2_temp+4(%rsp)
+        cmp             $3,%rax
+        jl              .L__vsacg
+
+        mov             8(%rsi),%ecx                    # do the third value
+        mov             %ecx,p2_temp+8(%rsp)
+        cmp             $4,%rax
+        jl              .L__vsacg
+
+        mov             12(%rsi),%ecx                   # do the fourth value
+        mov             %ecx,p2_temp+12(%rsp)
+        cmp             $5,%rax
+        jl              .L__vsacg
+
+        mov             16(%rsi),%ecx                   # do the fifth value
+        mov             %ecx,p2_temp+16(%rsp)
+        cmp             $6,%rax
+        jl              .L__vsacg
+
+        mov             20(%rsi),%ecx                   # do the sixth value
+        mov             %ecx,p2_temp+20(%rsp)
+        cmp             $7,%rax
+        jl              .L__vsacg
+
+        mov             24(%rsi),%ecx                   # do the last value
+        mov             %ecx,p2_temp+24(%rsp)
+
+.L__vsacg:
+	mov		$8,%rdi				# parameter for N
+        lea             p2_temp(%rsp),%rsi      # &x parameter
+        lea             p2_temp1(%rsp),%rdx      # &y parameter
+        call		vrsa_log10f@PLT	# call recursively to compute four values
+
+# now copy the results to the destination array
+        mov             save_ya(%rsp),%rdi
+        mov             save_nv(%rsp),%rax      # get number of values
+        mov             p2_temp1(%rsp),%ecx
+        mov             %ecx,(%rdi)                     # we know there's at least one
+        cmp             $2,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+4(%rsp),%ecx
+        mov             %ecx,4(%rdi)                    # do the second value
+        cmp             $3,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+8(%rsp),%ecx
+        mov             %ecx,8(%rdi)                    # do the second value
+        cmp             $4,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+12(%rsp),%ecx
+        mov             %ecx,12(%rdi)                   # do the second value
+        cmp             $5,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+16(%rsp),%ecx
+        mov             %ecx,16(%rdi)                   # do the second value
+        cmp             $6,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+20(%rsp),%ecx
+        mov             %ecx,20(%rdi)                   # do the second value
+        cmp             $7,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+24(%rsp),%ecx
+        mov             %ecx,24(%rdi)                   # do the last value
+
+.L__vsacgf:
+	jmp		.L__final_check
+
+
+.L__vlogf_e:
+	movdqa	p_x(%rsp),%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movdqa	%xmm2,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-e values
+	andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+	orps	%xmm3,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	jmp		.L__f1
+
+.L__vlogf_e2:
+		movdqa	p_x2(%rsp),%xmm2
+		cmpps	$0,.L__real_ef(%rip),%xmm2
+		movdqa	%xmm2,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-e values
+		andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+		orps	%xmm3,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		jmp		.L__f12
+
+	.align	16
+.L__near_one:
+# saves 10 cycles
+#      r = x - 1.0;
+	movdqa	%xmm3,p_omask(%rsp)	# save ones mask
+	movaps	p_x(%rsp),%xmm3
+	movaps	.L__real_two(%rip),%xmm2
+	subps	.L__real_one(%rip),%xmm3	   # r
+#      u          = r / (2.0 + r);
+	addps	%xmm3,%xmm2
+	movaps	%xmm3,%xmm1
+	divps	%xmm2,%xmm1		# u
+	movaps	.L__real_ca4(%rip),%xmm4	  #D
+	movaps	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movaps	%xmm3,%xmm6
+	mulps	%xmm1,%xmm6		# correction
+	movdqa	%xmm6,p_corr(%rsp)	# save correction
+#      u          = u + u;
+	addps	%xmm1,%xmm1		#u
+	movaps	%xmm1,%xmm2
+	mulps	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulps	%xmm1,%xmm5		# Cu
+	movaps	%xmm1,%xmm6
+	mulps	%xmm2,%xmm6		# u^3
+	mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulps	%xmm6,%xmm4		#Du^3
+
+	addps	.L__real_ca1(%rip),%xmm2	# +A
+	movaps	%xmm6,%xmm1
+	mulps	%xmm1,%xmm1		# u^6
+	addps	%xmm4,%xmm5		#Cu+Du3
+
+	mulps	%xmm6,%xmm2		#u3(A+Bu2)
+	mulps	%xmm5,%xmm1		#u6(Cu+Du3)
+	addps	%xmm1,%xmm2
+	subps	p_corr(%rsp),%xmm2		# -correction
+#  loge to log10
+	movaps   %xmm3,%xmm5 	#r1=r
+	pand 	.L__mask_lower(%rip),%xmm5
+	subps	%xmm5,%xmm3
+	addps	%xmm3,%xmm2	#r2 = r2 + (r-r1)
+
+	movaps	%xmm5,%xmm3
+	movaps	%xmm2,%xmm1
+
+	mulps 	.L__real_log10e_tail(%rip),%xmm2
+	mulps 	.L__real_log10e_tail(%rip),%xmm3
+	mulps 	.L__real_log10e_lead(%rip),%xmm1
+	mulps 	.L__real_log10e_lead(%rip),%xmm5
+	addps 	%xmm2,%xmm3
+	addps 	%xmm1,%xmm3
+	addps	%xmm5,%xmm3
+#      return r + r2;
+#	addps	%xmm2,%xmm3
+
+	movdqa	p_omask(%rsp),%xmm6
+	movdqa	%xmm6,%xmm2
+	andnps	%xmm0,%xmm6					# keep the non-nearone values
+	andps	%xmm3,%xmm2					# setup the nearone values
+	orps	%xmm6,%xmm2					# merge
+	movdqa	%xmm2,%xmm0					# and replace
+
+	jmp		.L__f4
+
+
+	.align	16
+.L__near_one2:
+# saves 10 cycles
+#      r = x - 1.0;
+		movdqa	%xmm9,p_omask(%rsp)	# save ones mask
+		movaps	p_x2(%rsp),%xmm3
+		movaps	.L__real_two(%rip),%xmm2
+		subps	.L__real_one(%rip),%xmm3	   # r
+	#      u          = r / (2.0 + r);
+		addps	%xmm3,%xmm2
+		movaps	%xmm3,%xmm7
+		divps	%xmm2,%xmm7		# u
+		movaps	.L__real_ca4(%rip),%xmm4	  #D
+		movaps	.L__real_ca3(%rip),%xmm5	  #C
+	#      correction = r * u;
+		movaps	%xmm3,%xmm6
+		mulps	%xmm7,%xmm6		# correction
+		movdqa	%xmm6,p_corr(%rsp)	# save correction
+	#      u          = u + u;
+		addps	%xmm7,%xmm7		#u
+		movaps	%xmm7,%xmm2
+		mulps	%xmm2,%xmm2		#v =u^2
+	#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+		mulps	%xmm7,%xmm5		# Cu
+		movaps	%xmm7,%xmm6
+		mulps	%xmm2,%xmm6		# u^3
+		mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+		mulps	%xmm6,%xmm4		#Du^3
+
+		addps	.L__real_ca1(%rip),%xmm2	# +A
+		movaps	%xmm6,%xmm7
+		mulps	%xmm7,%xmm7		# u^6
+		addps	%xmm4,%xmm5		#Cu+Du3
+
+		mulps	%xmm6,%xmm2		#u3(A+Bu2)
+		mulps	%xmm5,%xmm7		#u6(Cu+Du3)
+		addps	%xmm7,%xmm2
+		subps	p_corr(%rsp),%xmm2		# -correction
+
+		#loge to log10
+		movaps   %xmm3,%xmm5 	#r1=r
+		pand 	.L__mask_lower(%rip),%xmm5
+		subps	%xmm5,%xmm3
+		addps	%xmm3,%xmm2	#r2 = r2 + (r-r1)
+
+		movaps	%xmm5,%xmm3
+		movaps	%xmm2,%xmm7
+
+		mulps 	.L__real_log10e_tail(%rip),%xmm2
+		mulps 	.L__real_log10e_tail(%rip),%xmm3
+		mulps 	.L__real_log10e_lead(%rip),%xmm7
+		mulps 	.L__real_log10e_lead(%rip),%xmm5
+		addps 	%xmm2,%xmm3
+		addps 	%xmm7,%xmm3
+		addps	%xmm5,%xmm3
+	#      return r + r2;
+#		addps	%xmm2,%xmm3
+
+		movdqa	p_omask(%rsp),%xmm6
+		movdqa	%xmm6,%xmm2
+		andnps	%xmm1,%xmm6					# keep the non-nearone values
+		andps	%xmm3,%xmm2					# setup the nearone values
+		orps	%xmm6,%xmm2					# merge
+		movdqa	%xmm2,%xmm1					# and replace
+
+		jmp		.L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+	movdqa	%xmm1,%xmm3
+	andps	%xmm0,%xmm3							# keep the non-error values
+	andnps	.L__real_nan(%rip),%xmm1			# setup the nan values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+# check for +/- 0
+	xorps	%xmm1,%xmm1
+	cmpps	$0,p_x(%rsp),%xmm1	# 0 ?.
+	movmskps	%xmm1,%r9d
+	test		$0x0f,%r9d
+	jz		.L__zn2
+
+	movdqa	%xmm1,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	.L__real_ninf(%rip),%xmm1		# ; C99 specs -inf for +-0
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+
+.L__zn2:
+# check for NaNs
+	movaps	p_x(%rsp),%xmm3
+	andps	.L__real_inf(%rip),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+	movdqa	p_x(%rsp),%xmm4
+	pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+	pcmpeqd	.L__real_zero(%rip),%xmm4
+	pandn	%xmm3,%xmm4							# mask for NaNs
+	movdqa	%xmm4,%xmm2
+	movdqa	p_x(%rsp),%xmm1			# isolate the NaNs
+	pand	%xmm4,%xmm1
+
+	pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+	por		%xmm1,%xmm4							# turn SNaNs to QNaNs
+
+	movdqa	%xmm2,%xmm1
+	andnps	%xmm0,%xmm2							# keep the non-error values
+	orps	%xmm4,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	xorps	%xmm4,%xmm4
+
+	jmp		.L__f2
+
+# handle only +inf	 log(+inf) = inf
+.L__log_inf:
+	movdqa	%xmm3,%xmm1
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	p_x(%rsp),%xmm1			# setup the +inf values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+	jmp		.L__f3
+
+
+.L__z_or_neg2:
+	# deal with negatives first
+		movdqa	%xmm7,%xmm3
+		andps	%xmm1,%xmm3							# keep the non-error values
+		andnps	.L__real_nan(%rip),%xmm7			# setup the nan values
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+	# check for +/- 0
+		xorps	%xmm7,%xmm7
+		cmpps	$0,p_x2(%rsp),%xmm7	# 0 ?.
+		movmskps	%xmm7,%r9d
+		test		$0x0f,%r9d
+		jz		.L__zn22
+
+		movdqa	%xmm7,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-error values
+		andps	.L__real_ninf(%rip),%xmm7		# ; C99 specs -inf for +-0
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+
+.L__zn22:
+	# check for NaNs
+		movaps	p_x2(%rsp),%xmm3
+		andps	.L__real_inf(%rip),%xmm3
+		cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+		movdqa	p_x2(%rsp),%xmm4
+		pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+		pcmpeqd	.L__real_zero(%rip),%xmm4
+		pandn	%xmm3,%xmm4							# mask for NaNs
+		movdqa	%xmm4,%xmm2
+		movdqa	p_x2(%rsp),%xmm7			# isolate the NaNs
+		pand	%xmm4,%xmm7
+
+		pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+		por		%xmm7,%xmm4							# turn SNaNs to QNaNs
+
+		movdqa	%xmm2,%xmm7
+		andnps	%xmm1,%xmm2							# keep the non-error values
+		orps	%xmm4,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		xorps	%xmm4,%xmm4
+
+		jmp		.L__f22
+
+	# handle only +inf	 log(+inf) = inf
+.L__log_inf2:
+		movdqa	%xmm9,%xmm7
+		andnps	%xmm1,%xmm9							# keep the non-error values
+		andps	p_x2(%rsp),%xmm7			# setup the +inf values
+		orps	%xmm9,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+		jmp		.L__f32
+
+
+        .data
+        .align 64
+
+
+.L__real_zero:				.quad 0x00000000000000000	# 1.0
+					.quad 0x00000000000000000
+.L__real_one:				.quad 0x03f8000003f800000	# 1.0
+					.quad 0x03f8000003f800000
+.L__real_two:				.quad 0x04000000040000000	# 1.0
+					.quad 0x04000000040000000
+.L__real_ninf:				.quad 0x0ff800000ff800000	# -inf
+					.quad 0x0ff800000ff800000
+.L__real_inf:				.quad 0x07f8000007f800000	# +inf
+					.quad 0x07f8000007f800000
+.L__real_nan:				.quad 0x07fc000007fc00000	# NaN
+					.quad 0x07fc000007fc00000
+.L__real_ef:				.quad 0x0402DF854402DF854	# float e
+					.quad 0x0402DF854402DF854
+
+.L__real_sign:				.quad 0x08000000080000000	# sign bit
+					.quad 0x08000000080000000
+.L__real_notsign:			.quad 0x07ffFFFFF7ffFFFFF	# ^sign bit
+					.quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:			.quad 0x00040000000400000	# quiet nan bit
+					.quad 0x00040000000400000
+.L__real_mant:				.quad 0x0007FFFFF007FFFFF	# mantipsa bits
+					.quad 0x0007FFFFF007FFFFF
+.L__real_3c000000:			.quad 0x03c0000003c000000	# /* 0.0078125 = 1/128 */
+					.quad 0x03c0000003c000000
+.L__mask_127:				.quad 0x00000007f0000007f	#
+					.quad 0x00000007f0000007f
+.L__mask_040:				.quad 0x00000004000000040	#
+					.quad 0x00000004000000040
+.L__mask_001:				.quad 0x00000000100000001	#
+					.quad 0x00000000100000001
+
+
+.L__real_threshold:			.quad 0x03CF5C28F3CF5C28F	# .03
+					.quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333317923934e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.25000000037717509602e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3:				.quad 0x03B1249183B124918	# 2.23213998791944806202e-03
+					.quad 0x03B1249183B124918
+.L__real_ca4:				.quad 0x039E401A639E401A6	# 4.34887777707614552256e-04
+					.quad 0x039E401A639E401A6
+.L__real_cb1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333333593622e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.24999999978138668903e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3:				.quad 0x03B124A123B124A12	# 2.23219810758559851206e-03
+					.quad 0x03B124A123B124A12
+.L__real_log2_lead:			.quad 0x03F3170003F317000  # 0.693115234375
+                        		.quad 0x03F3170003F317000
+.L__real_log2_tail:     		.quad 0x03805FDF43805FDF4  # 0.000031946183
+                       			.quad 0x03805FDF43805FDF4
+.L__real_half:				.quad 0x03f0000003f000000	# 1/2
+					.quad 0x03f0000003f000000
+
+.L__real_log10e_lead:	     .quad 0x03EDE00003EDE0000	# log10e_lead  0.4335937500
+                       .quad 0x03EDE00003EDE0000
+.L__real_log10e_tail:	      .quad 0x03A37B1523A37B152  # log10e_tail  0.0007007319
+                       .quad 0x03A37B1523A37B152
+
+
+.L__mask_lower:				.quad 0x0ffff0000ffff0000	#
+					.quad 0x0ffff0000ffff0000
+
+.L__np_ln__table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3F8FC0A8B0FC03E4		# 1.55041813850402832031e-02
+	.quad	0x3F9F829B0E783300		# 3.07716131210327148438e-02
+	.quad	0x3FA77458F632DCFC		# 4.58095073699951171875e-02
+	.quad	0x3FAF0A30C01162A6		# 6.06245994567871093750e-02
+	.quad	0x3FB341D7961BD1D1		# 7.52233862876892089844e-02
+	.quad	0x3FB6F0D28AE56B4C		# 8.96121263504028320312e-02
+	.quad	0x3FBA926D3A4AD563		# 1.03796780109405517578e-01
+	.quad	0x3FBE27076E2AF2E6		# 1.17783010005950927734e-01
+	.quad	0x3FC0D77E7CD08E59		# 1.31576299667358398438e-01
+	.quad	0x3FC29552F81FF523		# 1.45181953907012939453e-01
+	.quad	0x3FC44D2B6CCB7D1E		# 1.58604979515075683594e-01
+	.quad	0x3FC5FF3070A793D4		# 1.71850204467773437500e-01
+	.quad	0x3FC7AB890210D909		# 1.84922337532043457031e-01
+	.quad	0x3FC9525A9CF456B4		# 1.97825729846954345703e-01
+	.quad	0x3FCAF3C94E80BFF3		# 2.10564732551574707031e-01
+	.quad	0x3FCC8FF7C79A9A22		# 2.23143517971038818359e-01
+	.quad	0x3FCE27076E2AF2E6		# 2.35566020011901855469e-01
+	.quad	0x3FCFB9186D5E3E2B		# 2.47836112976074218750e-01
+	.quad	0x3FD0A324E27390E3		# 2.59957492351531982422e-01
+	.quad	0x3FD1675CABABA60E		# 2.71933674812316894531e-01
+	.quad	0x3FD22941FBCF7966		# 2.83768117427825927734e-01
+	.quad	0x3FD2E8E2BAE11D31		# 2.95464158058166503906e-01
+	.quad	0x3FD3A64C556945EA		# 3.07025015354156494141e-01
+	.quad	0x3FD4618BC21C5EC2		# 3.18453729152679443359e-01
+	.quad	0x3FD51AAD872DF82D		# 3.29753279685974121094e-01
+	.quad	0x3FD5D1BDBF5809CA		# 3.40926527976989746094e-01
+	.quad	0x3FD686C81E9B14AF		# 3.51976394653320312500e-01
+	.quad	0x3FD739D7F6BBD007		# 3.62905442714691162109e-01
+	.quad	0x3FD7EAF83B82AFC3		# 3.73716354370117187500e-01
+	.quad	0x3FD89A3386C1425B		# 3.84411692619323730469e-01
+	.quad	0x3FD947941C2116FB		# 3.94993782043457031250e-01
+	.quad	0x3FD9F323ECBF984C		# 4.05465066432952880859e-01
+	.quad	0x3FDA9CEC9A9A084A		# 4.15827870368957519531e-01
+	.quad	0x3FDB44F77BCC8F63		# 4.26084339618682861328e-01
+	.quad	0x3FDBEB4D9DA71B7C		# 4.36236739158630371094e-01
+	.quad	0x3FDC8FF7C79A9A22		# 4.46287095546722412109e-01
+	.quad	0x3FDD32FE7E00EBD5		# 4.56237375736236572266e-01
+	.quad	0x3FDDD46A04C1C4A1		# 4.66089725494384765625e-01
+	.quad	0x3FDE744261D68788		# 4.75845873355865478516e-01
+	.quad	0x3FDF128F5FAF06ED		# 4.85507786273956298828e-01
+	.quad	0x3FDFAF588F78F31F		# 4.95077252388000488281e-01
+	.quad	0x3FE02552A5A5D0FF		# 5.04556000232696533203e-01
+	.quad	0x3FE0723E5C1CDF40		# 5.13945698738098144531e-01
+	.quad	0x3FE0BE72E4252A83		# 5.23248136043548583984e-01
+	.quad	0x3FE109F39E2D4C97		# 5.32464742660522460938e-01
+	.quad	0x3FE154C3D2F4D5EA		# 5.41597247123718261719e-01
+	.quad	0x3FE19EE6B467C96F		# 5.50647079944610595703e-01
+	.quad	0x3FE1E85F5E7040D0		# 5.59615731239318847656e-01
+	.quad	0x3FE23130D7BEBF43		# 5.68504691123962402344e-01
+	.quad	0x3FE2795E1289B11B		# 5.77315330505371093750e-01
+	.quad	0x3FE2C0E9ED448E8C		# 5.86049020290374755859e-01
+	.quad	0x3FE307D7334F10BE		# 5.94707071781158447266e-01
+	.quad	0x3FE34E289D9CE1D3		# 6.03290796279907226562e-01
+	.quad	0x3FE393E0D3562A1A		# 6.11801505088806152344e-01
+	.quad	0x3FE3D9026A7156FB		# 6.20240390300750732422e-01
+	.quad	0x3FE41D8FE84672AE		# 6.28608644008636474609e-01
+	.quad	0x3FE4618BC21C5EC2		# 6.36907458305358886719e-01
+	.quad	0x3FE4A4F85DB03EBB		# 6.45137906074523925781e-01
+	.quad	0x3FE4E7D811B75BB1		# 6.53301239013671875000e-01
+	.quad	0x3FE52A2D265BC5AB		# 6.61398470401763916016e-01
+	.quad	0x3FE56BF9D5B3F399		# 6.69430613517761230469e-01
+	.quad	0x3FE5AD404C359F2D		# 6.77398800849914550781e-01
+	.quad	0x3FE5EE02A9241675		# 6.85303986072540283203e-01
+	.quad	0x3FE62E42FEFA39EF		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_lead_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x3C7E0000  # 0.015502929688 1
+    .long 0x3CFC1000  # 0.030769348145 2
+    .long 0x3D3BA000  # 0.045806884766 3
+    .long 0x3D785000  # 0.060623168945 4
+    .long 0x3D9A0000  # 0.075195312500 5
+    .long 0x3DB78000  # 0.089599609375 6
+    .long 0x3DD49000  # 0.103790283203 7
+    .long 0x3DF13000  # 0.117767333984 8
+    .long 0x3E06B000  # 0.131530761719 9
+    .long 0x3E14A000  # 0.145141601563 10
+    .long 0x3E226000  # 0.158569335938 11
+    .long 0x3E2FF000  # 0.171813964844 12
+    .long 0x3E3D5000  # 0.184875488281 13
+    .long 0x3E4A9000  # 0.197814941406 14
+    .long 0x3E579000  # 0.210510253906 15
+    .long 0x3E647000  # 0.223083496094 16
+    .long 0x3E713000  # 0.235534667969 17
+    .long 0x3E7DC000  # 0.247802734375 18
+    .long 0x3E851000  # 0.259887695313 19
+    .long 0x3E8B3000  # 0.271850585938 20
+    .long 0x3E914000  # 0.283691406250 21
+    .long 0x3E974000  # 0.295410156250 22
+    .long 0x3E9D3000  # 0.307006835938 23
+    .long 0x3EA30000  # 0.318359375000 24
+    .long 0x3EA8D000  # 0.329711914063 25
+    .long 0x3EAE8000  # 0.340820312500 26
+    .long 0x3EB43000  # 0.351928710938 27
+    .long 0x3EB9C000  # 0.362792968750 28
+    .long 0x3EBF5000  # 0.373657226563 29
+    .long 0x3EC4D000  # 0.384399414063 30
+    .long 0x3ECA3000  # 0.394897460938 31
+    .long 0x3ECF9000  # 0.405395507813 32
+    .long 0x3ED4E000  # 0.415771484375 33
+    .long 0x3EDA2000  # 0.426025390625 34
+    .long 0x3EDF5000  # 0.436157226563 35
+    .long 0x3EE47000  # 0.446166992188 36
+    .long 0x3EE99000  # 0.456176757813 37
+    .long 0x3EEEA000  # 0.466064453125 38
+    .long 0x3EF3A000  # 0.475830078125 39
+    .long 0x3EF89000  # 0.485473632813 40
+    .long 0x3EFD7000  # 0.494995117188 41
+    .long 0x3F012000  # 0.504394531250 42
+    .long 0x3F039000  # 0.513916015625 43
+    .long 0x3F05F000  # 0.523193359375 44
+    .long 0x3F084000  # 0.532226562500 45
+    .long 0x3F0AA000  # 0.541503906250 46
+    .long 0x3F0CF000  # 0.550537109375 47
+    .long 0x3F0F4000  # 0.559570312500 48
+    .long 0x3F118000  # 0.568359375000 49
+    .long 0x3F13C000  # 0.577148437500 50
+    .long 0x3F160000  # 0.585937500000 51
+    .long 0x3F183000  # 0.594482421875 52
+    .long 0x3F1A7000  # 0.603271484375 53
+    .long 0x3F1C9000  # 0.611572265625 54
+    .long 0x3F1EC000  # 0.620117187500 55
+    .long 0x3F20E000  # 0.628417968750 56
+    .long 0x3F230000  # 0.636718750000 57
+    .long 0x3F252000  # 0.645019531250 58
+    .long 0x3F273000  # 0.653076171875 59
+    .long 0x3F295000  # 0.661376953125 60
+    .long 0x3F2B5000  # 0.669189453125 61
+    .long 0x3F2D6000  # 0.677246093750 62
+    .long 0x3F2F7000  # 0.685302734375 63
+    .long 0x3F317000  # 0.693115234375 64
+    .long 0					# for alignment
+
+.L__np_ln_tail_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x35A8B0FC  # 0.000001256848 1
+    .long 0x361B0E78  # 0.000002310522 2
+    .long 0x3631EC66  # 0.000002651266 3
+    .long 0x35C30046  # 0.000001452871 4
+    .long 0x37EBCB0E  # 0.000028108738 5
+    .long 0x37528AE5  # 0.000012549314 6
+    .long 0x36DA7496  # 0.000006510479 7
+    .long 0x3783B715  # 0.000015701671 8
+    .long 0x383F3E68  # 0.000045596069 9
+    .long 0x38297C10  # 0.000040408282 10
+    .long 0x3815B666  # 0.000035694240 11
+    .long 0x38183854  # 0.000036292084 12
+    .long 0x38448108  # 0.000046850211 13
+    .long 0x373539E9  # 0.000010801924 14
+    .long 0x3864A740  # 0.000054515200 15
+    .long 0x387BE3CD  # 0.000060055219 16
+    .long 0x3803B715  # 0.000031403342 17
+    .long 0x380C36AF  # 0.000033429529 18
+    .long 0x3892713A  # 0.000069829126 19
+    .long 0x38AE55D6  # 0.000083129547 20
+    .long 0x38A0FDE8  # 0.000076766883 21
+    .long 0x3862BAE1  # 0.000054056643 22
+    .long 0x3798AAD3  # 0.000018199358 23
+    .long 0x38C5E10E  # 0.000094356117 24
+    .long 0x382D872E  # 0.000041372310 25
+    .long 0x38DEDFAC  # 0.000106274470 26
+    .long 0x38481E9B  # 0.000047712219 27
+    .long 0x38EBFB5E  # 0.000112524940 28
+    .long 0x38783B83  # 0.000059183232 29
+    .long 0x374E1B05  # 0.000012284848 30
+    .long 0x38CA0E11  # 0.000096347307 31
+    .long 0x3891F660  # 0.000069600297 32
+    .long 0x386C9A9A  # 0.000056410769 33
+    .long 0x38777BCD  # 0.000059004688 34
+    .long 0x38A6CED4  # 0.000079540216 35
+    .long 0x38FBE3CD  # 0.000120110439 36
+    .long 0x387E7E01  # 0.000060675669 37
+    .long 0x37D40984  # 0.000025276800 38
+    .long 0x3784C3AD  # 0.000015826745 39
+    .long 0x380F5FAF  # 0.000034182969 40
+    .long 0x38AC47BC  # 0.000082149607 41
+    .long 0x392952D3  # 0.000161479504 42
+    .long 0x37F97073  # 0.000029735476 43
+    .long 0x3865C84A  # 0.000054784388 44
+    .long 0x3979CF17  # 0.000238236375 45
+    .long 0x38C3D2F5  # 0.000093376184 46
+    .long 0x38E6B468  # 0.000110008579 47
+    .long 0x383EBCE1  # 0.000045475437 48
+    .long 0x39186BDF  # 0.000145360347 49
+    .long 0x392F0945  # 0.000166927537 50
+    .long 0x38E9ED45  # 0.000111545007 51
+    .long 0x396B99A8  # 0.000224685878 52
+    .long 0x37A27674  # 0.000019367064 53
+    .long 0x397069AB  # 0.000229275480 54
+    .long 0x39013539  # 0.000123222257 55
+    .long 0x3947F423  # 0.000190690669 56
+    .long 0x3945E10E  # 0.000188712234 57
+    .long 0x38F85DB0  # 0.000118430122 58
+    .long 0x396C08DC  # 0.000225100142 59
+    .long 0x37B4996F  # 0.000021529120 60
+    .long 0x397CEADA  # 0.000241200818 61
+    .long 0x3920261B  # 0.000152729845 62
+    .long 0x35AA4906  # 0.000001268724 63
+    .long 0x3805FDF4  # 0.000031946183 64
+    .long 0					# for alignment
+

diff --git a/src/gas/vrsalog2f.S b/src/gas/vrsalog2f.S
new file mode 100644
index 0000000..9760d9f
--- /dev/null
+++ b/src/gas/vrsalog2f.S

@@ -0,0 +1,1140 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsalog2f.s
+#
+# An array implementation of the logf libm function.
+#
+# Prototype:
+#
+#    void vrsa_log2f(int n, float *x, float *y);
+#
+#   Computes the natural log of x.
+#   Places the results into the supplied y array.
+# Does not perform error handling, but does return C99 values for error
+# inputs.   Denormal results are truncated to 0.
+
+# This array version is basically a unrolling of the by4 scalar single
+# routine.  The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+       .weak vrsa_log2f_
+       .set vrsa_log2f_,__vrsa_log2f__
+       .weak vrsa_log2f__
+       .set vrsa_log2f__,__vrsa_log2f__
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array logf
+#**     VRSA_LOG2F(N,X,Y)
+# C equivalent*/
+#void vrsa_log2f__(int * n, float *x, float *y)
+#{
+#       vrsa_log2f(*n,x,y);
+#}
+.globl __vrsa_log2f__
+    .type   __vrsa_log2f__,@function
+__vrsa_log2f__:
+    mov         (%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+
+
+# define local variable storage offsets
+.equ	p_x,0			# save x
+.equ	p_idx,0x010		# xmmword index
+.equ	p_z1,0x020		# xmmword index
+.equ	p_q,0x030		# xmmword index
+.equ	p_corr,0x040		# xmmword index
+.equ	p_omask,0x050		# xmmword index
+
+
+.equ	p_x2,0x0100		# save x
+.equ	p_idx2,0x0110		# xmmword index
+.equ	p_z12,0x0120		# xmmword index
+.equ	p_q2,0x0130		# xmmword index
+
+.equ	save_xa,0x0140		#qword
+.equ	save_ya,0x0148		#qword
+.equ	save_nv,0x0150		#qword
+.equ	p_iter,0x0158		# qword	storage for number of loop iterations
+
+.equ	save_rbx,0x0160		#
+.equ	save_rdi,0x0168		#qword
+.equ	save_rsi,0x0170		#qword
+
+.equ	p2_temp,0x0180		#qword
+.equ	p2_temp1,0x01a0		#qword
+
+.equ	stack_size,0x01c8
+
+
+
+
+# parameters are passed in by gcc as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+.globl vrsa_log2f
+    .type   vrsa_log2f,@function
+vrsa_log2f:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rdi
+
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$3,%rax						# get number of iterations
+	jz		.L__vsa_cleanup				# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$3,%rax
+	sub		%rax,%rdi						# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+# In this second version, process the array 8 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movups	(%rsi),%xmm0
+		movups	16(%rsi),%xmm12
+#	movhps	.LQWORD,%xmm0 PTR [rsi+8]
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+
+# check e as a special case
+	movdqa	%xmm0,p_x(%rsp)	# save x
+		movdqa	%xmm12,p_x2(%rsp)	# save x
+	movdqa	%xmm0,%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movmskps	%xmm2,%r9d
+
+		movdqa	%xmm12,%xmm9
+		movaps	%xmm12,%xmm7
+
+#
+# compute the index into the log tables
+#
+	movdqa	%xmm0,%xmm3
+	movaps	%xmm0,%xmm1
+	psrld	$23,%xmm3
+
+	#
+	# compute the index into the log tables
+	#
+		psrld	$23,%xmm9
+		subps	.L__real_one(%rip),%xmm7
+		psubd	.L__mask_127(%rip),%xmm9
+	subps	.L__real_one(%rip),%xmm1
+	psubd	.L__mask_127(%rip),%xmm3
+		cvtdq2ps	%xmm9,%xmm13			# xexp
+
+		movdqa	%xmm12,%xmm9
+		pand	.L__real_mant(%rip),%xmm9
+		xor		%r8,%r8
+		movdqa	%xmm9,%xmm8
+		movaps	.L__real_half(%rip),%xmm11							# .5
+	cvtdq2ps	%xmm3,%xmm6			# xexp
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+	xor		%r8,%r8
+	movdqa	%xmm3,%xmm2
+	movaps	.L__real_half(%rip),%xmm5							# .5
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrld	$16,%xmm3
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	movdqa	%xmm3,%xmm4
+		psrld	$16,%xmm9
+		movdqa	%xmm9,%xmm10
+		psrld	$1,%xmm9
+	psrld	$1,%xmm3
+	paddd	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm4
+	paddd	%xmm4,%xmm3
+	cvtdq2ps	%xmm3,%xmm1
+	#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+		paddd	.L__mask_040(%rip),%xmm9
+		pand	.L__mask_001(%rip),%xmm10
+		paddd	%xmm10,%xmm9
+		cvtdq2ps	%xmm9,%xmm7
+	packssdw	%xmm3,%xmm3
+	movq	%xmm3,p_idx(%rsp)
+		packssdw	%xmm9,%xmm9
+		movq	%xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+	movdqa	%xmm0,%xmm3
+	orps		.L__real_half(%rip),%xmm2
+
+
+	mulps	.L__real_3c000000(%rip),%xmm1				# f1 = index/128
+	# reduce and get u
+
+
+	subps	%xmm1,%xmm2											# f2 = f - f1
+	mulps	%xmm2,%xmm5
+	addps	%xmm5,%xmm1
+
+	divps	%xmm1,%xmm2				# u
+
+		movdqa	%xmm12,%xmm9
+		orps		.L__real_half(%rip),%xmm8
+
+
+		mulps	.L__real_3c000000(%rip),%xmm7				# f1 = index/128
+		subps	%xmm7,%xmm8											# f2 = f - f1
+		mulps	%xmm8,%xmm11
+		addps	%xmm11,%xmm7
+
+
+        mov             p_idx(%rsp),%rcx                        # get the indexes
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        mov             -256(%rdx,%r8,4),%ebx           # get the f1 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_z1(%rsp)                        # save the f1 values
+
+
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        or              -256(%rdx,%r8,4),%ebx           # get the f1 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_z1+8(%rsp)                      # save the f1 value
+
+		divps	%xmm7,%xmm8				# u
+		lea		.L__np_ln_lead_table(%rip),%rdx
+                mov             p_idx2(%rsp),%rcx                       # get the indexes
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                mov             -256(%rdx,%r8,4),%ebx           # get the f1 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_z12(%rsp)                       # save the f1 values
+
+
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                or              -256(%rdx,%r8,4),%ebx           # get the f1 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_z12+8(%rsp)                     # save the f1 value
+
+# solve for ln(1+u)
+	movaps	%xmm2,%xmm1				# u
+	mulps	%xmm2,%xmm2				# u^2
+	movaps	%xmm2,%xmm5
+	movaps	.L__real_cb3(%rip),%xmm3
+	mulps	%xmm2,%xmm3				#Cu2
+	mulps	%xmm1,%xmm5				# u^3
+	addps	.L__real_cb2(%rip),%xmm3 #B+Cu2
+	movaps	%xmm2,%xmm4
+	mulps	%xmm5,%xmm4				# u^5
+	movaps	.L__real_log2e_lead(%rip),%xmm2
+
+	mulps	.L__real_cb1(%rip),%xmm5 #Au3
+	addps	%xmm5,%xmm1				# u+Au3
+	mulps	%xmm3,%xmm4				# u5(B+Cu2)
+	movaps	.L__real_log2e_tail(%rip),%xmm3
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	addps	%xmm4,%xmm1				# poly
+
+# recombine
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+        mov             %cx,%r8w
+        shr             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+        mov             %cx,%r8w
+        shr             $16,%rcx
+        or              -256(%rdx,%r8,4),%ebx           # get the f2 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_q(%rsp)                         # save the f2 value
+
+
+        mov             %cx,%r8w
+        shr             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+        mov             %cx,%r8w
+        mov             -256(%rdx,%r8,4),%ebx           # get the f2 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_q+8(%rsp)                       # save the f2 value
+
+	addps	p_q(%rsp),%xmm1 #z2	+=q
+	movaps	%xmm1,%xmm4	#z2 copy
+	movaps	p_z1(%rsp),%xmm0			# z1  values
+	movaps	%xmm0,%xmm5	#z1 copy
+
+	mulps	%xmm2,%xmm5	#z1*log2e_lead
+	mulps	%xmm2,%xmm1	#z2*log2e_lead
+	mulps	%xmm3,%xmm4	#z2*log2e_tail
+	mulps	%xmm3,%xmm0	#z1*log2e_tail
+	addps	%xmm6,%xmm5	#r1 = z1*log2e_lead + xexp
+	addps	%xmm4,%xmm0	#z1*log2e_tail + z2*log2e_tail
+	addps	%xmm1,%xmm0	#r2
+#return r1+r2
+	addps 	%xmm5,%xmm0	# r1+ r2
+
+
+# check for e
+#	test		$0x0f,%r9d
+#	jnz			.L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+	xorps	%xmm1,%xmm1
+	cmpps	$1,p_x(%rsp),%xmm1	# 0 greater than =?. catches NaNs also.
+	movmskps	%xmm1,%r9d
+	cmp		$0x0f,%r9d
+	jnz		.L__z_or_neg
+
+.L__f2:
+###  if +inf
+	movaps	p_x(%rsp),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test		$0x0f,%r9d
+	jnz		.L__log_inf
+.L__f3:
+
+	movaps	p_x(%rsp),%xmm3
+	subps	.L__real_one(%rip),%xmm3
+	andps	.L__real_notsign(%rip),%xmm3
+	cmpps	$2,.L__real_threshold(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test	$0x0f,%r9d
+	jnz		.L__near_one
+.L__f4:
+
+# store the result _m128d
+	mov		save_ya(%rsp),%rdi	# get y_array pointer
+	movups	%xmm0,(%rdi)
+
+# finish the second set of calculations
+
+	# solve for ln(1+u)
+		movaps	%xmm8,%xmm7				# u
+		mulps	%xmm8,%xmm8				# u^2
+		movaps	%xmm8,%xmm11
+
+		movaps	.L__real_cb3(%rip),%xmm9
+		mulps	%xmm8,%xmm9				#Cu2
+		mulps	%xmm7,%xmm11				# u^3
+		addps	.L__real_cb2(%rip),%xmm9 #B+Cu2
+		movaps	%xmm8,%xmm10
+		mulps	%xmm11,%xmm10				# u^5
+		movaps	.L__real_log2e_lead(%rip),%xmm8
+
+		mulps	.L__real_cb1(%rip),%xmm11 #Au3
+		addps	%xmm11,%xmm7				# u+Au3
+		mulps	%xmm9,%xmm10				# u5(B+Cu2)
+		movaps	.L__real_log2e_tail(%rip),%xmm9
+		addps	%xmm10,%xmm7				# poly
+
+
+	# recombine
+		lea		.L__np_ln_tail_table(%rip),%rdx
+		mov		p_idx2(%rsp),%rcx 			# get the indexes
+		mov             %cx,%r8w
+                shr             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+                mov             %cx,%r8w
+                shr             $16,%rcx
+                or              -256(%rdx,%r8,4),%ebx           # get the f2 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_q2(%rsp)                        # save the f2 value
+
+
+                mov             %cx,%r8w
+                shr             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+                mov             %cx,%r8w
+                mov             -256(%rdx,%r8,4),%ebx           # get the f2 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_q2+8(%rsp)                      # save the f2 value
+
+		addps	p_q2(%rsp),%xmm7 #z2	+=q
+		movaps	%xmm7,%xmm10	#z2 copy
+		movaps	p_z12(%rsp),%xmm1			# z1  values
+		movaps	%xmm1,%xmm11	#z1 copy
+
+		mulps	%xmm8,%xmm11	#z1*log2e_lead
+		mulps	%xmm8,%xmm7	#z2*log2e_lead
+		mulps	%xmm9,%xmm10	#z2*log2e_tail
+		mulps	%xmm9,%xmm1	#z1*log2e_tail
+		addps	%xmm13,%xmm11	#r1 = z1*log2e_lead + xexp
+		addps	%xmm10,%xmm1	#z1*log2e_tail + z2*log2e_tail
+		addps	%xmm7,%xmm1	#r2
+		#return r1+r2
+		addps 	%xmm11,%xmm1	# r1+ r2
+
+	# check e as a special case
+#		movaps	p_x2(%rsp),%xmm10
+#		cmpps	$0,.L__real_ef(%rip),%xmm10
+#		movmskps	%xmm10,%r9d
+	# check for e
+#		test		$0x0f,%r9d
+#		jnz			.L__vlogf_e2
+.L__f12:
+
+	# check for negative numbers or zero
+		xorps	%xmm7,%xmm7
+		cmpps	$1,p_x2(%rsp),%xmm7	# 0 greater than =?. catches NaNs also.
+		movmskps	%xmm7,%r9d
+		cmp		$0x0f,%r9d
+		jnz		.L__z_or_neg2
+
+.L__f22:
+	###  if +inf
+		movaps	p_x2(%rsp),%xmm9
+		cmpps	$0,.L__real_inf(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test		$0x0f,%r9d
+		jnz		.L__log_inf2
+.L__f32:
+
+		movaps	p_x2(%rsp),%xmm9
+		subps	.L__real_one(%rip),%xmm9
+		andps	.L__real_notsign(%rip),%xmm9
+		cmpps	$2,.L__real_threshold(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test	$0x0f,%r9d
+		jnz		.L__near_one2
+.L__f42:
+
+
+	prefetch	64(%rsi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+                movups  %xmm1,-16(%rdi)
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vsa_top
+
+
+# see if we need to do any extras
+	mov		save_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz		.L__vsa_cleanup
+
+
+#
+.L__final_check:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+
+
+	.align	16
+# we jump here when we have an odd number of log calls to make at the
+# end
+.L__vsa_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__final_check         # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+	xorps		%xmm0,%xmm0
+        movaps   %xmm0,p2_temp(%rsp)
+        movaps   %xmm0,p2_temp+16(%rsp)
+
+        mov             (%rsi),%ecx                     # we know there's at least one
+        mov             %ecx,p2_temp(%rsp)
+        cmp             $2,%rax
+        jl              .L__vsacg
+
+        mov             4(%rsi),%ecx                    # do the second value
+        mov             %ecx,p2_temp+4(%rsp)
+        cmp             $3,%rax
+        jl              .L__vsacg
+
+        mov             8(%rsi),%ecx                    # do the third value
+        mov             %ecx,p2_temp+8(%rsp)
+        cmp             $4,%rax
+        jl              .L__vsacg
+
+        mov             12(%rsi),%ecx                   # do the fourth value
+        mov             %ecx,p2_temp+12(%rsp)
+        cmp             $5,%rax
+        jl              .L__vsacg
+
+        mov             16(%rsi),%ecx                   # do the fifth value
+        mov             %ecx,p2_temp+16(%rsp)
+        cmp             $6,%rax
+        jl              .L__vsacg
+
+        mov             20(%rsi),%ecx                   # do the sixth value
+        mov             %ecx,p2_temp+20(%rsp)
+        cmp             $7,%rax
+        jl              .L__vsacg
+
+        mov             24(%rsi),%ecx                   # do the last value
+        mov             %ecx,p2_temp+24(%rsp)
+
+.L__vsacg:
+	mov		$8,%rdi				# parameter for N
+        lea             p2_temp(%rsp),%rsi      # &x parameter
+        lea             p2_temp1(%rsp),%rdx      # &y parameter
+        call		vrsa_log2f@PLT	# call recursively to compute four values
+
+# now copy the results to the destination array
+        mov             save_ya(%rsp),%rdi
+        mov             save_nv(%rsp),%rax      # get number of values
+        mov             p2_temp1(%rsp),%ecx
+        mov             %ecx,(%rdi)                     # we know there's at least one
+        cmp             $2,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+4(%rsp),%ecx
+        mov             %ecx,4(%rdi)                    # do the second value
+        cmp             $3,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+8(%rsp),%ecx
+        mov             %ecx,8(%rdi)                    # do the second value
+        cmp             $4,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+12(%rsp),%ecx
+        mov             %ecx,12(%rdi)                   # do the second value
+        cmp             $5,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+16(%rsp),%ecx
+        mov             %ecx,16(%rdi)                   # do the second value
+        cmp             $6,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+20(%rsp),%ecx
+        mov             %ecx,20(%rdi)                   # do the second value
+        cmp             $7,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+24(%rsp),%ecx
+        mov             %ecx,24(%rdi)                   # do the last value
+
+.L__vsacgf:
+	jmp		.L__final_check
+
+
+.L__vlogf_e:
+	movdqa	p_x(%rsp),%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movdqa	%xmm2,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-e values
+	andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+	orps	%xmm3,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	jmp		.L__f1
+
+.L__vlogf_e2:
+		movdqa	p_x2(%rsp),%xmm2
+		cmpps	$0,.L__real_ef(%rip),%xmm2
+		movdqa	%xmm2,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-e values
+		andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+		orps	%xmm3,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		jmp		.L__f12
+
+	.align	16
+.L__near_one:
+# saves 10 cycles
+#      r = x - 1.0;
+	movdqa	%xmm3,p_omask(%rsp)	# save ones mask
+	movaps	p_x(%rsp),%xmm3
+	movaps	.L__real_two(%rip),%xmm2
+	subps	.L__real_one(%rip),%xmm3	   # r
+#      u          = r / (2.0 + r);
+	addps	%xmm3,%xmm2
+	movaps	%xmm3,%xmm1
+	divps	%xmm2,%xmm1		# u
+	movaps	.L__real_ca4(%rip),%xmm4	  #D
+	movaps	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movaps	%xmm3,%xmm6
+	mulps	%xmm1,%xmm6		# correction
+	movdqa	%xmm6,p_corr(%rsp)	# save correction
+#      u          = u + u;
+	addps	%xmm1,%xmm1		#u
+	movaps	%xmm1,%xmm2
+	mulps	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulps	%xmm1,%xmm5		# Cu
+	movaps	%xmm1,%xmm6
+	mulps	%xmm2,%xmm6		# u^3
+	mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulps	%xmm6,%xmm4		#Du^3
+
+	addps	.L__real_ca1(%rip),%xmm2	# +A
+	movaps	%xmm6,%xmm1
+	mulps	%xmm1,%xmm1		# u^6
+	addps	%xmm4,%xmm5		#Cu+Du3
+
+	mulps	%xmm6,%xmm2		#u3(A+Bu2)
+	mulps	%xmm5,%xmm1		#u6(Cu+Du3)
+	addps	%xmm1,%xmm2
+	subps	p_corr(%rsp),%xmm2		# -correction
+
+
+#   loge to log2
+	movaps  %xmm3,%xmm5 	#r1=r
+	pand 	.L__mask_lower(%rip),%xmm5
+	subps	%xmm5,%xmm3
+	addps	%xmm3,%xmm2	#r2 = r2 + (r-r1)
+
+	movaps	%xmm5,%xmm3
+	movaps	%xmm2,%xmm1
+
+	mulps 	.L__real_log2e_tail(%rip),%xmm2
+	mulps 	.L__real_log2e_tail(%rip),%xmm3
+	mulps 	.L__real_log2e_lead(%rip),%xmm1
+	mulps 	.L__real_log2e_lead(%rip),%xmm5
+	addps 	%xmm2,%xmm3
+	addps 	%xmm1,%xmm3
+	addps	%xmm5,%xmm3
+
+#      return r + r2;
+#	addps	%xmm2,%xmm3
+
+	movdqa	p_omask(%rsp),%xmm6
+	movdqa	%xmm6,%xmm2
+	andnps	%xmm0,%xmm6					# keep the non-nearone values
+	andps	%xmm3,%xmm2					# setup the nearone values
+	orps	%xmm6,%xmm2					# merge
+	movdqa	%xmm2,%xmm0					# and replace
+
+	jmp		.L__f4
+
+
+	.align	16
+.L__near_one2:
+# saves 10 cycles
+#      r = x - 1.0;
+		movdqa	%xmm9,p_omask(%rsp)	# save ones mask
+		movaps	p_x2(%rsp),%xmm3
+		movaps	.L__real_two(%rip),%xmm2
+		subps	.L__real_one(%rip),%xmm3	   # r
+	#      u          = r / (2.0 + r);
+		addps	%xmm3,%xmm2
+		movaps	%xmm3,%xmm7
+		divps	%xmm2,%xmm7		# u
+		movaps	.L__real_ca4(%rip),%xmm4	  #D
+		movaps	.L__real_ca3(%rip),%xmm5	  #C
+	#      correction = r * u;
+		movaps	%xmm3,%xmm6
+		mulps	%xmm7,%xmm6		# correction
+		movdqa	%xmm6,p_corr(%rsp)	# save correction
+	#      u          = u + u;
+		addps	%xmm7,%xmm7		#u
+		movaps	%xmm7,%xmm2
+		mulps	%xmm2,%xmm2		#v =u^2
+	#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+		mulps	%xmm7,%xmm5		# Cu
+		movaps	%xmm7,%xmm6
+		mulps	%xmm2,%xmm6		# u^3
+		mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+		mulps	%xmm6,%xmm4		#Du^3
+
+		addps	.L__real_ca1(%rip),%xmm2	# +A
+		movaps	%xmm6,%xmm7
+		mulps	%xmm7,%xmm7		# u^6
+		addps	%xmm4,%xmm5		#Cu+Du3
+
+		mulps	%xmm6,%xmm2		#u3(A+Bu2)
+		mulps	%xmm5,%xmm7		#u6(Cu+Du3)
+		addps	%xmm7,%xmm2
+		subps	p_corr(%rsp),%xmm2		# -correction
+
+# loge to log2
+		movaps  %xmm3,%xmm5 	#r1=r
+		pand 	.L__mask_lower(%rip),%xmm5
+		subps	%xmm5,%xmm3
+		addps	%xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+		movaps	%xmm5,%xmm3
+		movaps	%xmm2,%xmm7
+
+		mulps 	.L__real_log2e_tail(%rip),%xmm2
+		mulps 	.L__real_log2e_tail(%rip),%xmm3
+		mulps 	.L__real_log2e_lead(%rip),%xmm7
+		mulps 	.L__real_log2e_lead(%rip),%xmm5
+		addps 	%xmm2,%xmm3
+		addps 	%xmm7,%xmm3
+		addps	%xmm5,%xmm3
+
+	#      return r + r2;
+#		addps	%xmm2,%xmm3
+
+		movdqa	p_omask(%rsp),%xmm6
+		movdqa	%xmm6,%xmm2
+		andnps	%xmm1,%xmm6					# keep the non-nearone values
+		andps	%xmm3,%xmm2					# setup the nearone values
+		orps	%xmm6,%xmm2					# merge
+		movdqa	%xmm2,%xmm1					# and replace
+
+		jmp		.L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+	movdqa	%xmm1,%xmm3
+	andps	%xmm0,%xmm3							# keep the non-error values
+	andnps	.L__real_nan(%rip),%xmm1			# setup the nan values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+# check for +/- 0
+	xorps	%xmm1,%xmm1
+	cmpps	$0,p_x(%rsp),%xmm1	# 0 ?.
+	movmskps	%xmm1,%r9d
+	test		$0x0f,%r9d
+	jz		.L__zn2
+
+	movdqa	%xmm1,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	.L__real_ninf(%rip),%xmm1		# ; C99 specs -inf for +-0
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+
+.L__zn2:
+# check for NaNs
+	movaps	p_x(%rsp),%xmm3
+	andps	.L__real_inf(%rip),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+	movdqa	p_x(%rsp),%xmm4
+	pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+	pcmpeqd	.L__real_zero(%rip),%xmm4
+	pandn	%xmm3,%xmm4							# mask for NaNs
+	movdqa	%xmm4,%xmm2
+	movdqa	p_x(%rsp),%xmm1			# isolate the NaNs
+	pand	%xmm4,%xmm1
+
+	pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+	por		%xmm1,%xmm4							# turn SNaNs to QNaNs
+
+	movdqa	%xmm2,%xmm1
+	andnps	%xmm0,%xmm2							# keep the non-error values
+	orps	%xmm4,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	xorps	%xmm4,%xmm4
+
+	jmp		.L__f2
+
+# handle only +inf	 log(+inf) = inf
+.L__log_inf:
+	movdqa	%xmm3,%xmm1
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	p_x(%rsp),%xmm1			# setup the +inf values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+	jmp		.L__f3
+
+
+.L__z_or_neg2:
+	# deal with negatives first
+		movdqa	%xmm7,%xmm3
+		andps	%xmm1,%xmm3							# keep the non-error values
+		andnps	.L__real_nan(%rip),%xmm7			# setup the nan values
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+	# check for +/- 0
+		xorps	%xmm7,%xmm7
+		cmpps	$0,p_x2(%rsp),%xmm7	# 0 ?.
+		movmskps	%xmm7,%r9d
+		test		$0x0f,%r9d
+		jz		.L__zn22
+
+		movdqa	%xmm7,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-error values
+		andps	.L__real_ninf(%rip),%xmm7		# ; C99 specs -inf for +-0
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+
+.L__zn22:
+	# check for NaNs
+		movaps	p_x2(%rsp),%xmm3
+		andps	.L__real_inf(%rip),%xmm3
+		cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+		movdqa	p_x2(%rsp),%xmm4
+		pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+		pcmpeqd	.L__real_zero(%rip),%xmm4
+		pandn	%xmm3,%xmm4							# mask for NaNs
+		movdqa	%xmm4,%xmm2
+		movdqa	p_x2(%rsp),%xmm7			# isolate the NaNs
+		pand	%xmm4,%xmm7
+
+		pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+		por		%xmm7,%xmm4							# turn SNaNs to QNaNs
+
+		movdqa	%xmm2,%xmm7
+		andnps	%xmm1,%xmm2							# keep the non-error values
+		orps	%xmm4,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		xorps	%xmm4,%xmm4
+
+		jmp		.L__f22
+
+	# handle only +inf	 log(+inf) = inf
+.L__log_inf2:
+		movdqa	%xmm9,%xmm7
+		andnps	%xmm1,%xmm9							# keep the non-error values
+		andps	p_x2(%rsp),%xmm7			# setup the +inf values
+		orps	%xmm9,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+		jmp		.L__f32
+
+
+        .data
+        .align 64
+
+
+.L__real_zero:				.quad 0x00000000000000000	# 1.0
+					.quad 0x00000000000000000
+.L__real_one:				.quad 0x03f8000003f800000	# 1.0
+					.quad 0x03f8000003f800000
+.L__real_two:				.quad 0x04000000040000000	# 1.0
+					.quad 0x04000000040000000
+.L__real_ninf:				.quad 0x0ff800000ff800000	# -inf
+					.quad 0x0ff800000ff800000
+.L__real_inf:				.quad 0x07f8000007f800000	# +inf
+					.quad 0x07f8000007f800000
+.L__real_nan:				.quad 0x07fc000007fc00000	# NaN
+					.quad 0x07fc000007fc00000
+.L__real_ef:				.quad 0x0402DF854402DF854	# float e
+					.quad 0x0402DF854402DF854
+
+.L__real_sign:				.quad 0x08000000080000000	# sign bit
+					.quad 0x08000000080000000
+.L__real_notsign:			.quad 0x07ffFFFFF7ffFFFFF	# ^sign bit
+					.quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:			.quad 0x00040000000400000	# quiet nan bit
+					.quad 0x00040000000400000
+.L__real_mant:				.quad 0x0007FFFFF007FFFFF	# mantipsa bits
+					.quad 0x0007FFFFF007FFFFF
+.L__real_3c000000:			.quad 0x03c0000003c000000	# /* 0.0078125 = 1/128 */
+					.quad 0x03c0000003c000000
+.L__mask_127:				.quad 0x00000007f0000007f	#
+					.quad 0x00000007f0000007f
+.L__mask_040:				.quad 0x00000004000000040	#
+					.quad 0x00000004000000040
+.L__mask_001:				.quad 0x00000000100000001	#
+					.quad 0x00000000100000001
+
+
+.L__real_threshold:			.quad 0x03CF5C28F3CF5C28F	# .03
+					.quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333317923934e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.25000000037717509602e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3:				.quad 0x03B1249183B124918	# 2.23213998791944806202e-03
+					.quad 0x03B1249183B124918
+.L__real_ca4:				.quad 0x039E401A639E401A6	# 4.34887777707614552256e-04
+					.quad 0x039E401A639E401A6
+.L__real_cb1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333333593622e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.24999999978138668903e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3:				.quad 0x03B124A123B124A12	# 2.23219810758559851206e-03
+					.quad 0x03B124A123B124A12
+.L__real_log2_lead:			.quad 0x03F3170003F317000  # 0.693115234375
+                        		.quad 0x03F3170003F317000
+.L__real_log2_tail:     		.quad 0x03805FDF43805FDF4  # 0.000031946183
+                       			.quad 0x03805FDF43805FDF4
+.L__real_half:				.quad 0x03f0000003f000000	# 1/2
+					.quad 0x03f0000003f000000
+.L__real_log2e_lead:       .quad 0x03FB800003FB80000  #1.4375000000
+                        .quad 0x03FB800003FB80000
+.L__real_log2e_tail:       .quad 0x03BAA3B293BAA3B29  # 0.0051950408889633
+                        .quad 0x03BAA3B293BAA3B29
+
+.L__mask_lower:			.quad 0x0ffff0000ffff0000	#
+						.quad 0x0ffff0000ffff0000
+
+.L__np_ln__table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3F8FC0A8B0FC03E4		# 1.55041813850402832031e-02
+	.quad	0x3F9F829B0E783300		# 3.07716131210327148438e-02
+	.quad	0x3FA77458F632DCFC		# 4.58095073699951171875e-02
+	.quad	0x3FAF0A30C01162A6		# 6.06245994567871093750e-02
+	.quad	0x3FB341D7961BD1D1		# 7.52233862876892089844e-02
+	.quad	0x3FB6F0D28AE56B4C		# 8.96121263504028320312e-02
+	.quad	0x3FBA926D3A4AD563		# 1.03796780109405517578e-01
+	.quad	0x3FBE27076E2AF2E6		# 1.17783010005950927734e-01
+	.quad	0x3FC0D77E7CD08E59		# 1.31576299667358398438e-01
+	.quad	0x3FC29552F81FF523		# 1.45181953907012939453e-01
+	.quad	0x3FC44D2B6CCB7D1E		# 1.58604979515075683594e-01
+	.quad	0x3FC5FF3070A793D4		# 1.71850204467773437500e-01
+	.quad	0x3FC7AB890210D909		# 1.84922337532043457031e-01
+	.quad	0x3FC9525A9CF456B4		# 1.97825729846954345703e-01
+	.quad	0x3FCAF3C94E80BFF3		# 2.10564732551574707031e-01
+	.quad	0x3FCC8FF7C79A9A22		# 2.23143517971038818359e-01
+	.quad	0x3FCE27076E2AF2E6		# 2.35566020011901855469e-01
+	.quad	0x3FCFB9186D5E3E2B		# 2.47836112976074218750e-01
+	.quad	0x3FD0A324E27390E3		# 2.59957492351531982422e-01
+	.quad	0x3FD1675CABABA60E		# 2.71933674812316894531e-01
+	.quad	0x3FD22941FBCF7966		# 2.83768117427825927734e-01
+	.quad	0x3FD2E8E2BAE11D31		# 2.95464158058166503906e-01
+	.quad	0x3FD3A64C556945EA		# 3.07025015354156494141e-01
+	.quad	0x3FD4618BC21C5EC2		# 3.18453729152679443359e-01
+	.quad	0x3FD51AAD872DF82D		# 3.29753279685974121094e-01
+	.quad	0x3FD5D1BDBF5809CA		# 3.40926527976989746094e-01
+	.quad	0x3FD686C81E9B14AF		# 3.51976394653320312500e-01
+	.quad	0x3FD739D7F6BBD007		# 3.62905442714691162109e-01
+	.quad	0x3FD7EAF83B82AFC3		# 3.73716354370117187500e-01
+	.quad	0x3FD89A3386C1425B		# 3.84411692619323730469e-01
+	.quad	0x3FD947941C2116FB		# 3.94993782043457031250e-01
+	.quad	0x3FD9F323ECBF984C		# 4.05465066432952880859e-01
+	.quad	0x3FDA9CEC9A9A084A		# 4.15827870368957519531e-01
+	.quad	0x3FDB44F77BCC8F63		# 4.26084339618682861328e-01
+	.quad	0x3FDBEB4D9DA71B7C		# 4.36236739158630371094e-01
+	.quad	0x3FDC8FF7C79A9A22		# 4.46287095546722412109e-01
+	.quad	0x3FDD32FE7E00EBD5		# 4.56237375736236572266e-01
+	.quad	0x3FDDD46A04C1C4A1		# 4.66089725494384765625e-01
+	.quad	0x3FDE744261D68788		# 4.75845873355865478516e-01
+	.quad	0x3FDF128F5FAF06ED		# 4.85507786273956298828e-01
+	.quad	0x3FDFAF588F78F31F		# 4.95077252388000488281e-01
+	.quad	0x3FE02552A5A5D0FF		# 5.04556000232696533203e-01
+	.quad	0x3FE0723E5C1CDF40		# 5.13945698738098144531e-01
+	.quad	0x3FE0BE72E4252A83		# 5.23248136043548583984e-01
+	.quad	0x3FE109F39E2D4C97		# 5.32464742660522460938e-01
+	.quad	0x3FE154C3D2F4D5EA		# 5.41597247123718261719e-01
+	.quad	0x3FE19EE6B467C96F		# 5.50647079944610595703e-01
+	.quad	0x3FE1E85F5E7040D0		# 5.59615731239318847656e-01
+	.quad	0x3FE23130D7BEBF43		# 5.68504691123962402344e-01
+	.quad	0x3FE2795E1289B11B		# 5.77315330505371093750e-01
+	.quad	0x3FE2C0E9ED448E8C		# 5.86049020290374755859e-01
+	.quad	0x3FE307D7334F10BE		# 5.94707071781158447266e-01
+	.quad	0x3FE34E289D9CE1D3		# 6.03290796279907226562e-01
+	.quad	0x3FE393E0D3562A1A		# 6.11801505088806152344e-01
+	.quad	0x3FE3D9026A7156FB		# 6.20240390300750732422e-01
+	.quad	0x3FE41D8FE84672AE		# 6.28608644008636474609e-01
+	.quad	0x3FE4618BC21C5EC2		# 6.36907458305358886719e-01
+	.quad	0x3FE4A4F85DB03EBB		# 6.45137906074523925781e-01
+	.quad	0x3FE4E7D811B75BB1		# 6.53301239013671875000e-01
+	.quad	0x3FE52A2D265BC5AB		# 6.61398470401763916016e-01
+	.quad	0x3FE56BF9D5B3F399		# 6.69430613517761230469e-01
+	.quad	0x3FE5AD404C359F2D		# 6.77398800849914550781e-01
+	.quad	0x3FE5EE02A9241675		# 6.85303986072540283203e-01
+	.quad	0x3FE62E42FEFA39EF		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_lead_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x3C7E0000  # 0.015502929688 1
+    .long 0x3CFC1000  # 0.030769348145 2
+    .long 0x3D3BA000  # 0.045806884766 3
+    .long 0x3D785000  # 0.060623168945 4
+    .long 0x3D9A0000  # 0.075195312500 5
+    .long 0x3DB78000  # 0.089599609375 6
+    .long 0x3DD49000  # 0.103790283203 7
+    .long 0x3DF13000  # 0.117767333984 8
+    .long 0x3E06B000  # 0.131530761719 9
+    .long 0x3E14A000  # 0.145141601563 10
+    .long 0x3E226000  # 0.158569335938 11
+    .long 0x3E2FF000  # 0.171813964844 12
+    .long 0x3E3D5000  # 0.184875488281 13
+    .long 0x3E4A9000  # 0.197814941406 14
+    .long 0x3E579000  # 0.210510253906 15
+    .long 0x3E647000  # 0.223083496094 16
+    .long 0x3E713000  # 0.235534667969 17
+    .long 0x3E7DC000  # 0.247802734375 18
+    .long 0x3E851000  # 0.259887695313 19
+    .long 0x3E8B3000  # 0.271850585938 20
+    .long 0x3E914000  # 0.283691406250 21
+    .long 0x3E974000  # 0.295410156250 22
+    .long 0x3E9D3000  # 0.307006835938 23
+    .long 0x3EA30000  # 0.318359375000 24
+    .long 0x3EA8D000  # 0.329711914063 25
+    .long 0x3EAE8000  # 0.340820312500 26
+    .long 0x3EB43000  # 0.351928710938 27
+    .long 0x3EB9C000  # 0.362792968750 28
+    .long 0x3EBF5000  # 0.373657226563 29
+    .long 0x3EC4D000  # 0.384399414063 30
+    .long 0x3ECA3000  # 0.394897460938 31
+    .long 0x3ECF9000  # 0.405395507813 32
+    .long 0x3ED4E000  # 0.415771484375 33
+    .long 0x3EDA2000  # 0.426025390625 34
+    .long 0x3EDF5000  # 0.436157226563 35
+    .long 0x3EE47000  # 0.446166992188 36
+    .long 0x3EE99000  # 0.456176757813 37
+    .long 0x3EEEA000  # 0.466064453125 38
+    .long 0x3EF3A000  # 0.475830078125 39
+    .long 0x3EF89000  # 0.485473632813 40
+    .long 0x3EFD7000  # 0.494995117188 41
+    .long 0x3F012000  # 0.504394531250 42
+    .long 0x3F039000  # 0.513916015625 43
+    .long 0x3F05F000  # 0.523193359375 44
+    .long 0x3F084000  # 0.532226562500 45
+    .long 0x3F0AA000  # 0.541503906250 46
+    .long 0x3F0CF000  # 0.550537109375 47
+    .long 0x3F0F4000  # 0.559570312500 48
+    .long 0x3F118000  # 0.568359375000 49
+    .long 0x3F13C000  # 0.577148437500 50
+    .long 0x3F160000  # 0.585937500000 51
+    .long 0x3F183000  # 0.594482421875 52
+    .long 0x3F1A7000  # 0.603271484375 53
+    .long 0x3F1C9000  # 0.611572265625 54
+    .long 0x3F1EC000  # 0.620117187500 55
+    .long 0x3F20E000  # 0.628417968750 56
+    .long 0x3F230000  # 0.636718750000 57
+    .long 0x3F252000  # 0.645019531250 58
+    .long 0x3F273000  # 0.653076171875 59
+    .long 0x3F295000  # 0.661376953125 60
+    .long 0x3F2B5000  # 0.669189453125 61
+    .long 0x3F2D6000  # 0.677246093750 62
+    .long 0x3F2F7000  # 0.685302734375 63
+    .long 0x3F317000  # 0.693115234375 64
+    .long 0					# for alignment
+
+.L__np_ln_tail_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x35A8B0FC  # 0.000001256848 1
+    .long 0x361B0E78  # 0.000002310522 2
+    .long 0x3631EC66  # 0.000002651266 3
+    .long 0x35C30046  # 0.000001452871 4
+    .long 0x37EBCB0E  # 0.000028108738 5
+    .long 0x37528AE5  # 0.000012549314 6
+    .long 0x36DA7496  # 0.000006510479 7
+    .long 0x3783B715  # 0.000015701671 8
+    .long 0x383F3E68  # 0.000045596069 9
+    .long 0x38297C10  # 0.000040408282 10
+    .long 0x3815B666  # 0.000035694240 11
+    .long 0x38183854  # 0.000036292084 12
+    .long 0x38448108  # 0.000046850211 13
+    .long 0x373539E9  # 0.000010801924 14
+    .long 0x3864A740  # 0.000054515200 15
+    .long 0x387BE3CD  # 0.000060055219 16
+    .long 0x3803B715  # 0.000031403342 17
+    .long 0x380C36AF  # 0.000033429529 18
+    .long 0x3892713A  # 0.000069829126 19
+    .long 0x38AE55D6  # 0.000083129547 20
+    .long 0x38A0FDE8  # 0.000076766883 21
+    .long 0x3862BAE1  # 0.000054056643 22
+    .long 0x3798AAD3  # 0.000018199358 23
+    .long 0x38C5E10E  # 0.000094356117 24
+    .long 0x382D872E  # 0.000041372310 25
+    .long 0x38DEDFAC  # 0.000106274470 26
+    .long 0x38481E9B  # 0.000047712219 27
+    .long 0x38EBFB5E  # 0.000112524940 28
+    .long 0x38783B83  # 0.000059183232 29
+    .long 0x374E1B05  # 0.000012284848 30
+    .long 0x38CA0E11  # 0.000096347307 31
+    .long 0x3891F660  # 0.000069600297 32
+    .long 0x386C9A9A  # 0.000056410769 33
+    .long 0x38777BCD  # 0.000059004688 34
+    .long 0x38A6CED4  # 0.000079540216 35
+    .long 0x38FBE3CD  # 0.000120110439 36
+    .long 0x387E7E01  # 0.000060675669 37
+    .long 0x37D40984  # 0.000025276800 38
+    .long 0x3784C3AD  # 0.000015826745 39
+    .long 0x380F5FAF  # 0.000034182969 40
+    .long 0x38AC47BC  # 0.000082149607 41
+    .long 0x392952D3  # 0.000161479504 42
+    .long 0x37F97073  # 0.000029735476 43
+    .long 0x3865C84A  # 0.000054784388 44
+    .long 0x3979CF17  # 0.000238236375 45
+    .long 0x38C3D2F5  # 0.000093376184 46
+    .long 0x38E6B468  # 0.000110008579 47
+    .long 0x383EBCE1  # 0.000045475437 48
+    .long 0x39186BDF  # 0.000145360347 49
+    .long 0x392F0945  # 0.000166927537 50
+    .long 0x38E9ED45  # 0.000111545007 51
+    .long 0x396B99A8  # 0.000224685878 52
+    .long 0x37A27674  # 0.000019367064 53
+    .long 0x397069AB  # 0.000229275480 54
+    .long 0x39013539  # 0.000123222257 55
+    .long 0x3947F423  # 0.000190690669 56
+    .long 0x3945E10E  # 0.000188712234 57
+    .long 0x38F85DB0  # 0.000118430122 58
+    .long 0x396C08DC  # 0.000225100142 59
+    .long 0x37B4996F  # 0.000021529120 60
+    .long 0x397CEADA  # 0.000241200818 61
+    .long 0x3920261B  # 0.000152729845 62
+    .long 0x35AA4906  # 0.000001268724 63
+    .long 0x3805FDF4  # 0.000031946183 64
+    .long 0					# for alignment
+

diff --git a/src/gas/vrsalogf.S b/src/gas/vrsalogf.S
new file mode 100644
index 0000000..1f96523
--- /dev/null
+++ b/src/gas/vrsalogf.S

@@ -0,0 +1,1088 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsalogf.s
+#
+# An array implementation of the logf libm function.
+#
+# Prototype:
+#
+#    void vrsa_logf(int n, float *x, float *y);
+#
+#   Computes the natural log of x.
+#   Places the results into the supplied y array.
+# Does not perform error handling, but does return C99 values for error
+# inputs.   Denormal results are truncated to 0.
+
+# This array version is basically a unrolling of the by4 scalar single
+# routine.  The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+       .weak vrsa_logf_
+       .set vrsa_logf_,__vrsa_logf__
+       .weak vrsa_logf__
+       .set vrsa_logf__,__vrsa_logf__
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array logf
+#**     VRSA_LOGF(N,X,Y)
+# C equivalent*/
+#void vrsa_logf__(int * n, float *x, float *y)
+#{
+#       vrsa_logf(*n,x,y);
+#}
+.globl __vrsa_logf__
+    .type   __vrsa_logf__,@function
+__vrsa_logf__:
+    mov         (%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+
+
+# define local variable storage offsets
+.equ	p_x,0			# save x
+.equ	p_idx,0x010		# xmmword index
+.equ	p_z1,0x020		# xmmword index
+.equ	p_q,0x030		# xmmword index
+.equ	p_corr,0x040		# xmmword index
+.equ	p_omask,0x050		# xmmword index
+
+
+.equ	p_x2,0x0100		# save x
+.equ	p_idx2,0x0110		# xmmword index
+.equ	p_z12,0x0120		# xmmword index
+.equ	p_q2,0x0130		# xmmword index
+
+.equ	save_xa,0x0140		#qword
+.equ	save_ya,0x0148		#qword
+.equ	save_nv,0x0150		#qword
+.equ	p_iter,0x0158		# qword	storage for number of loop iterations
+
+.equ	save_rbx,0x0160		#
+.equ	save_rdi,0x0168		#qword
+.equ	save_rsi,0x0170		#qword
+
+.equ	p2_temp,0x0180		#qword
+.equ	p2_temp1,0x01a0		#qword
+
+.equ	stack_size,0x01c8
+
+
+
+
+# parameters are passed in by gcc as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+.globl vrsa_logf
+    .type   vrsa_logf,@function
+vrsa_logf:
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rdi
+
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$3,%rax						# get number of iterations
+	jz		.L__vsa_cleanup				# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$3,%rax
+	sub		%rax,%rdi						# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+# In this second version, process the array 8 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movups	(%rsi),%xmm0
+		movups	16(%rsi),%xmm12
+#	movhps	.LQWORD,%xmm0 PTR [rsi+8]
+	prefetch	64(%rsi)
+	add		$32,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+
+# check e as a special case
+	movdqa	%xmm0,p_x(%rsp)	# save x
+		movdqa	%xmm12,p_x2(%rsp)	# save x
+	movdqa	%xmm0,%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movmskps	%xmm2,%r9d
+
+		movdqa	%xmm12,%xmm9
+		movaps	%xmm12,%xmm7
+
+#
+# compute the index into the log tables
+#
+	movdqa	%xmm0,%xmm3
+	movaps	%xmm0,%xmm1
+	psrld	$23,%xmm3
+
+	#
+	# compute the index into the log tables
+	#
+		psrld	$23,%xmm9
+		subps	.L__real_one(%rip),%xmm7
+		psubd	.L__mask_127(%rip),%xmm9
+	subps	.L__real_one(%rip),%xmm1
+	psubd	.L__mask_127(%rip),%xmm3
+		cvtdq2ps	%xmm9,%xmm13			# xexp
+
+		movdqa	%xmm12,%xmm9
+		pand	.L__real_mant(%rip),%xmm9
+		xor		%r8,%r8
+		movdqa	%xmm9,%xmm8
+		movaps	.L__real_half(%rip),%xmm11							# .5
+	cvtdq2ps	%xmm3,%xmm6			# xexp
+
+	movdqa	%xmm0,%xmm3
+	pand	.L__real_mant(%rip),%xmm3
+	xor		%r8,%r8
+	movdqa	%xmm3,%xmm2
+	movaps	.L__real_half(%rip),%xmm5							# .5
+
+#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+	psrld	$16,%xmm3
+	lea		.L__np_ln_lead_table(%rip),%rdx
+	movdqa	%xmm3,%xmm4
+		psrld	$16,%xmm9
+		movdqa	%xmm9,%xmm10
+		psrld	$1,%xmm9
+	psrld	$1,%xmm3
+	paddd	.L__mask_040(%rip),%xmm3
+	pand	.L__mask_001(%rip),%xmm4
+	paddd	%xmm4,%xmm3
+	cvtdq2ps	%xmm3,%xmm1
+	#/* Now  x = 2**xexp  * f,  1/2 <= f < 1. */
+		paddd	.L__mask_040(%rip),%xmm9
+		pand	.L__mask_001(%rip),%xmm10
+		paddd	%xmm10,%xmm9
+		cvtdq2ps	%xmm9,%xmm7
+	packssdw	%xmm3,%xmm3
+	movq	%xmm3,p_idx(%rsp)
+		packssdw	%xmm9,%xmm9
+		movq	%xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+	movdqa	%xmm0,%xmm3
+	orps		.L__real_half(%rip),%xmm2
+
+
+	mulps	.L__real_3c000000(%rip),%xmm1				# f1 = index/128
+	# reduce and get u
+
+
+	subps	%xmm1,%xmm2											# f2 = f - f1
+	mulps	%xmm2,%xmm5
+	addps	%xmm5,%xmm1
+
+	divps	%xmm1,%xmm2				# u
+
+		movdqa	%xmm12,%xmm9
+		orps		.L__real_half(%rip),%xmm8
+
+
+		mulps	.L__real_3c000000(%rip),%xmm7				# f1 = index/128
+		subps	%xmm7,%xmm8											# f2 = f - f1
+		mulps	%xmm8,%xmm11
+		addps	%xmm11,%xmm7
+
+
+        mov             p_idx(%rsp),%rcx                        # get the indexes
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        mov             -256(%rdx,%r8,4),%ebx           # get the f1 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_z1(%rsp)                        # save the f1 values
+
+
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+        mov             %cx,%r8w
+        ror             $16,%rcx
+        or              -256(%rdx,%r8,4),%ebx           # get the f1 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_z1+8(%rsp)                      # save the f1 value
+
+		divps	%xmm7,%xmm8				# u
+		lea		.L__np_ln_lead_table(%rip),%rdx
+                mov             p_idx2(%rsp),%rcx                       # get the indexes
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                mov             -256(%rdx,%r8,4),%ebx           # get the f1 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_z12(%rsp)                       # save the f1 values
+
+
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f1 value
+
+                mov             %cx,%r8w
+                ror             $16,%rcx
+                or              -256(%rdx,%r8,4),%ebx           # get the f1 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_z12+8(%rsp)                     # save the f1 value
+
+# solve for ln(1+u)
+	movaps	%xmm2,%xmm1				# u
+	mulps	%xmm2,%xmm2				# u^2
+	movaps	%xmm2,%xmm5
+	movaps	.L__real_cb3(%rip),%xmm3
+	mulps	%xmm2,%xmm3				#Cu2
+	mulps	%xmm1,%xmm5				# u^3
+	addps	.L__real_cb2(%rip),%xmm3 #B+Cu2
+	movaps	%xmm2,%xmm4
+	mulps	%xmm5,%xmm4				# u^5
+	movaps	.L__real_log2_lead(%rip),%xmm2
+
+	mulps	.L__real_cb1(%rip),%xmm5 #Au3
+	addps	%xmm5,%xmm1				# u+Au3
+	mulps	%xmm3,%xmm4				# u5(B+Cu2)
+
+	lea		.L__np_ln_tail_table(%rip),%rdx
+	addps	%xmm4,%xmm1				# poly
+
+# recombine
+	mov		p_idx(%rsp),%rcx 			# get the indexes
+        mov             %cx,%r8w
+        shr             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+        mov             %cx,%r8w
+        shr             $16,%rcx
+        or              -256(%rdx,%r8,4),%ebx           # get the f2 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_q(%rsp)                         # save the f2 value
+
+
+        mov             %cx,%r8w
+        shr             $16,%rcx
+        mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+        mov             %cx,%r8w
+        mov             -256(%rdx,%r8,4),%ebx           # get the f2 value
+        shl             $32,%rbx
+        or              %rbx,%rax
+        mov              %rax,p_q+8(%rsp)                       # save the f2 value
+
+	addps	p_q(%rsp),%xmm1 #z2	+=q
+
+	movaps	p_z1(%rsp),%xmm0			# z1  values
+
+	mulps	%xmm6,%xmm2
+	addps	%xmm2,%xmm0				#r1
+	mulps	.L__real_log2_tail(%rip),%xmm6
+	addps	%xmm6,%xmm1				#r2
+	addps	%xmm1,%xmm0
+
+
+
+# check for e
+	test		$0x0f,%r9d
+	jnz			.L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+	xorps	%xmm1,%xmm1
+	cmpps	$1,p_x(%rsp),%xmm1	# 0 greater than =?. catches NaNs also.
+	movmskps	%xmm1,%r9d
+	cmp		$0x0f,%r9d
+	jnz		.L__z_or_neg
+
+.L__f2:
+##  if +inf
+	movaps	p_x(%rsp),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test		$0x0f,%r9d
+	jnz		.L__log_inf
+.L__f3:
+
+	movaps	p_x(%rsp),%xmm3
+	subps	.L__real_one(%rip),%xmm3
+	andps	.L__real_notsign(%rip),%xmm3
+	cmpps	$2,.L__real_threshold(%rip),%xmm3
+	movmskps	%xmm3,%r9d
+	test	$0x0f,%r9d
+	jnz		.L__near_one
+.L__f4:
+
+# store the result _m128d
+	mov		save_ya(%rsp),%rdi	# get y_array pointer
+	movups	%xmm0,(%rdi)
+
+# finish the second set of calculations
+
+	# solve for ln(1+u)
+		movaps	%xmm8,%xmm7				# u
+		mulps	%xmm8,%xmm8				# u^2
+		movaps	%xmm8,%xmm11
+
+		movaps	.L__real_cb3(%rip),%xmm9
+		mulps	%xmm8,%xmm9				#Cu2
+		mulps	%xmm7,%xmm11				# u^3
+		addps	.L__real_cb2(%rip),%xmm9 #B+Cu2
+		movaps	%xmm8,%xmm10
+		mulps	%xmm11,%xmm10				# u^5
+		movaps	.L__real_log2_lead(%rip),%xmm8
+
+		mulps	.L__real_cb1(%rip),%xmm11 #Au3
+		addps	%xmm11,%xmm7				# u+Au3
+		mulps	%xmm9,%xmm10				# u5(B+Cu2)
+		addps	%xmm10,%xmm7				# poly
+
+
+	# recombine
+		lea		.L__np_ln_tail_table(%rip),%rdx
+		mov		p_idx2(%rsp),%rcx 			# get the indexes
+		mov             %cx,%r8w
+                shr             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+                mov             %cx,%r8w
+                shr             $16,%rcx
+                or              -256(%rdx,%r8,4),%ebx           # get the f2 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_q2(%rsp)                        # save the f2 value
+
+
+                mov             %cx,%r8w
+                shr             $16,%rcx
+                mov             -256(%rdx,%r8,4),%eax           # get the f2 value
+
+                mov             %cx,%r8w
+                mov             -256(%rdx,%r8,4),%ebx           # get the f2 value
+                shl             $32,%rbx
+                or              %rbx,%rax
+                mov              %rax,p_q2+8(%rsp)                      # save the f2 value
+
+                addps   p_q2(%rsp),%xmm7 #z2    +=q
+		movaps	p_z12(%rsp),%xmm1			# z1  values
+
+		mulps	%xmm13,%xmm8
+		addps	%xmm8,%xmm1				#r1
+		mulps	.L__real_log2_tail(%rip),%xmm13
+		addps	%xmm13,%xmm7				#r2
+		addps	%xmm7,%xmm1
+
+	# check e as a special case
+		movaps	p_x2(%rsp),%xmm10
+		cmpps	$0,.L__real_ef(%rip),%xmm10
+		movmskps	%xmm10,%r9d
+	# check for e
+		test		$0x0f,%r9d
+		jnz			.L__vlogf_e2
+.L__f12:
+
+	# check for negative numbers or zero
+		xorps	%xmm7,%xmm7
+		cmpps	$1,p_x2(%rsp),%xmm7	# 0 greater than =?. catches NaNs also.
+		movmskps	%xmm7,%r9d
+		cmp		$0x0f,%r9d
+		jnz		.L__z_or_neg2
+
+.L__f22:
+	##  if +inf
+		movaps	p_x2(%rsp),%xmm9
+		cmpps	$0,.L__real_inf(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test		$0x0f,%r9d
+		jnz		.L__log_inf2
+.L__f32:
+
+		movaps	p_x2(%rsp),%xmm9
+		subps	.L__real_one(%rip),%xmm9
+		andps	.L__real_notsign(%rip),%xmm9
+		cmpps	$2,.L__real_threshold(%rip),%xmm9
+		movmskps	%xmm9,%r9d
+		test	$0x0f,%r9d
+		jnz		.L__near_one2
+.L__f42:
+
+
+	prefetch	64(%rsi)
+	add		$32,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+# store the result _m128d
+                movups  %xmm1,-16(%rdi)
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vsa_top
+
+
+# see if we need to do any extras
+	mov		save_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz		.L__vsa_cleanup
+
+
+#
+.L__final_check:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+
+
+	.align	16
+# we jump here when we have an odd number of log calls to make at the
+# end
+.L__vsa_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__final_check         # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+	xorps		%xmm0,%xmm0
+        movaps   %xmm0,p2_temp(%rsp)
+        movaps   %xmm0,p2_temp+16(%rsp)
+
+        mov             (%rsi),%ecx                     # we know there's at least one
+        mov             %ecx,p2_temp(%rsp)
+        cmp             $2,%rax
+        jl              .L__vsacg
+
+        mov             4(%rsi),%ecx                    # do the second value
+        mov             %ecx,p2_temp+4(%rsp)
+        cmp             $3,%rax
+        jl              .L__vsacg
+
+        mov             8(%rsi),%ecx                    # do the third value
+        mov             %ecx,p2_temp+8(%rsp)
+        cmp             $4,%rax
+        jl              .L__vsacg
+
+        mov             12(%rsi),%ecx                   # do the fourth value
+        mov             %ecx,p2_temp+12(%rsp)
+        cmp             $5,%rax
+        jl              .L__vsacg
+
+        mov             16(%rsi),%ecx                   # do the fifth value
+        mov             %ecx,p2_temp+16(%rsp)
+        cmp             $6,%rax
+        jl              .L__vsacg
+
+        mov             20(%rsi),%ecx                   # do the sixth value
+        mov             %ecx,p2_temp+20(%rsp)
+        cmp             $7,%rax
+        jl              .L__vsacg
+
+        mov             24(%rsi),%ecx                   # do the last value
+        mov             %ecx,p2_temp+24(%rsp)
+
+.L__vsacg:
+	mov		$8,%rdi				# parameter for N
+        lea             p2_temp(%rsp),%rsi      # &x parameter
+        lea             p2_temp1(%rsp),%rdx      # &y parameter
+        call		vrsa_logf@PLT	# call recursively to compute four values
+
+# now copy the results to the destination array
+        mov             save_ya(%rsp),%rdi
+        mov             save_nv(%rsp),%rax      # get number of values
+        mov             p2_temp1(%rsp),%ecx
+        mov             %ecx,(%rdi)                     # we know there's at least one
+        cmp             $2,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+4(%rsp),%ecx
+        mov             %ecx,4(%rdi)                    # do the second value
+        cmp             $3,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+8(%rsp),%ecx
+        mov             %ecx,8(%rdi)                    # do the second value
+        cmp             $4,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+12(%rsp),%ecx
+        mov             %ecx,12(%rdi)                   # do the second value
+        cmp             $5,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+16(%rsp),%ecx
+        mov             %ecx,16(%rdi)                   # do the second value
+        cmp             $6,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+20(%rsp),%ecx
+        mov             %ecx,20(%rdi)                   # do the second value
+        cmp             $7,%rax
+        jl              .L__vsacgf
+
+        mov             p2_temp1+24(%rsp),%ecx
+        mov             %ecx,24(%rdi)                   # do the last value
+
+.L__vsacgf:
+	jmp		.L__final_check
+
+
+.L__vlogf_e:
+	movdqa	p_x(%rsp),%xmm2
+	cmpps	$0,.L__real_ef(%rip),%xmm2
+	movdqa	%xmm2,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-e values
+	andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+	orps	%xmm3,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	jmp		.L__f1
+
+.L__vlogf_e2:
+		movdqa	p_x2(%rsp),%xmm2
+		cmpps	$0,.L__real_ef(%rip),%xmm2
+		movdqa	%xmm2,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-e values
+		andps	.L__real_one(%rip),%xmm2			# setup the 1 values
+		orps	%xmm3,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		jmp		.L__f12
+
+	.align	16
+.L__near_one:
+# saves 10 cycles
+#      r = x - 1.0;
+	movdqa	%xmm3,p_omask(%rsp)	# save ones mask
+	movaps	p_x(%rsp),%xmm3
+	movaps	.L__real_two(%rip),%xmm2
+	subps	.L__real_one(%rip),%xmm3	   # r
+#      u          = r / (2.0 + r);
+	addps	%xmm3,%xmm2
+	movaps	%xmm3,%xmm1
+	divps	%xmm2,%xmm1		# u
+	movaps	.L__real_ca4(%rip),%xmm4	  #D
+	movaps	.L__real_ca3(%rip),%xmm5	  #C
+#      correction = r * u;
+	movaps	%xmm3,%xmm6
+	mulps	%xmm1,%xmm6		# correction
+	movdqa	%xmm6,p_corr(%rsp)	# save correction
+#      u          = u + u;
+	addps	%xmm1,%xmm1		#u
+	movaps	%xmm1,%xmm2
+	mulps	%xmm2,%xmm2		#v =u^2
+#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+	mulps	%xmm1,%xmm5		# Cu
+	movaps	%xmm1,%xmm6
+	mulps	%xmm2,%xmm6		# u^3
+	mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+	mulps	%xmm6,%xmm4		#Du^3
+
+	addps	.L__real_ca1(%rip),%xmm2	# +A
+	movaps	%xmm6,%xmm1
+	mulps	%xmm1,%xmm1		# u^6
+	addps	%xmm4,%xmm5		#Cu+Du3
+
+	mulps	%xmm6,%xmm2		#u3(A+Bu2)
+	mulps	%xmm5,%xmm1		#u6(Cu+Du3)
+	addps	%xmm1,%xmm2
+	subps	p_corr(%rsp),%xmm2		# -correction
+
+#      return r + r2;
+	addps	%xmm2,%xmm3
+
+	movdqa	p_omask(%rsp),%xmm6
+	movdqa	%xmm6,%xmm2
+	andnps	%xmm0,%xmm6					# keep the non-nearone values
+	andps	%xmm3,%xmm2					# setup the nearone values
+	orps	%xmm6,%xmm2					# merge
+	movdqa	%xmm2,%xmm0					# and replace
+
+	jmp		.L__f4
+
+
+	.align	16
+.L__near_one2:
+# saves 10 cycles
+#      r = x - 1.0;
+		movdqa	%xmm9,p_omask(%rsp)	# save ones mask
+		movaps	p_x2(%rsp),%xmm3
+		movaps	.L__real_two(%rip),%xmm2
+		subps	.L__real_one(%rip),%xmm3	   # r
+	#      u          = r / (2.0 + r);
+		addps	%xmm3,%xmm2
+		movaps	%xmm3,%xmm7
+		divps	%xmm2,%xmm7		# u
+		movaps	.L__real_ca4(%rip),%xmm4	  #D
+		movaps	.L__real_ca3(%rip),%xmm5	  #C
+	#      correction = r * u;
+		movaps	%xmm3,%xmm6
+		mulps	%xmm7,%xmm6		# correction
+		movdqa	%xmm6,p_corr(%rsp)	# save correction
+	#      u          = u + u;
+		addps	%xmm7,%xmm7		#u
+		movaps	%xmm7,%xmm2
+		mulps	%xmm2,%xmm2		#v =u^2
+	#      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+		mulps	%xmm7,%xmm5		# Cu
+		movaps	%xmm7,%xmm6
+		mulps	%xmm2,%xmm6		# u^3
+		mulps	.L__real_ca2(%rip),%xmm2	#Bu^2
+		mulps	%xmm6,%xmm4		#Du^3
+
+		addps	.L__real_ca1(%rip),%xmm2	# +A
+		movaps	%xmm6,%xmm7
+		mulps	%xmm7,%xmm7		# u^6
+		addps	%xmm4,%xmm5		#Cu+Du3
+
+		mulps	%xmm6,%xmm2		#u3(A+Bu2)
+		mulps	%xmm5,%xmm7		#u6(Cu+Du3)
+		addps	%xmm7,%xmm2
+		subps	p_corr(%rsp),%xmm2		# -correction
+
+	#      return r + r2;
+		addps	%xmm2,%xmm3
+
+		movdqa	p_omask(%rsp),%xmm6
+		movdqa	%xmm6,%xmm2
+		andnps	%xmm1,%xmm6					# keep the non-nearone values
+		andps	%xmm3,%xmm2					# setup the nearone values
+		orps	%xmm6,%xmm2					# merge
+		movdqa	%xmm2,%xmm1					# and replace
+
+		jmp		.L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+	movdqa	%xmm1,%xmm3
+	andps	%xmm0,%xmm3							# keep the non-error values
+	andnps	.L__real_nan(%rip),%xmm1			# setup the nan values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+# check for +/- 0
+	xorps	%xmm1,%xmm1
+	cmpps	$0,p_x(%rsp),%xmm1	# 0 ?.
+	movmskps	%xmm1,%r9d
+	test		$0x0f,%r9d
+	jz		.L__zn2
+
+	movdqa	%xmm1,%xmm3
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	.L__real_ninf(%rip),%xmm1		# ; C99 specs -inf for +-0
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+
+.L__zn2:
+# check for NaNs
+	movaps	p_x(%rsp),%xmm3
+	andps	.L__real_inf(%rip),%xmm3
+	cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+	movdqa	p_x(%rsp),%xmm4
+	pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+	pcmpeqd	.L__real_zero(%rip),%xmm4
+	pandn	%xmm3,%xmm4							# mask for NaNs
+	movdqa	%xmm4,%xmm2
+	movdqa	p_x(%rsp),%xmm1			# isolate the NaNs
+	pand	%xmm4,%xmm1
+
+	pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+	por		%xmm1,%xmm4							# turn SNaNs to QNaNs
+
+	movdqa	%xmm2,%xmm1
+	andnps	%xmm0,%xmm2							# keep the non-error values
+	orps	%xmm4,%xmm2							# merge
+	movdqa	%xmm2,%xmm0							# and replace
+	xorps	%xmm4,%xmm4
+
+	jmp		.L__f2
+
+# handle only +inf	 log(+inf) = inf
+.L__log_inf:
+	movdqa	%xmm3,%xmm1
+	andnps	%xmm0,%xmm3							# keep the non-error values
+	andps	p_x(%rsp),%xmm1			# setup the +inf values
+	orps	%xmm3,%xmm1							# merge
+	movdqa	%xmm1,%xmm0							# and replace
+	jmp		.L__f3
+
+
+.L__z_or_neg2:
+	# deal with negatives first
+		movdqa	%xmm7,%xmm3
+		andps	%xmm1,%xmm3							# keep the non-error values
+		andnps	.L__real_nan(%rip),%xmm7			# setup the nan values
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+	# check for +/- 0
+		xorps	%xmm7,%xmm7
+		cmpps	$0,p_x2(%rsp),%xmm7	# 0 ?.
+		movmskps	%xmm7,%r9d
+		test		$0x0f,%r9d
+		jz		.L__zn22
+
+		movdqa	%xmm7,%xmm3
+		andnps	%xmm1,%xmm3							# keep the non-error values
+		andps	.L__real_ninf(%rip),%xmm7		# ; C99 specs -inf for +-0
+		orps	%xmm3,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+
+.L__zn22:
+	# check for NaNs
+		movaps	p_x2(%rsp),%xmm3
+		andps	.L__real_inf(%rip),%xmm3
+		cmpps	$0,.L__real_inf(%rip),%xmm3		# mask for max exponent
+
+		movdqa	p_x2(%rsp),%xmm4
+		pand	.L__real_mant(%rip),%xmm4		# mask for non-zero mantissa
+		pcmpeqd	.L__real_zero(%rip),%xmm4
+		pandn	%xmm3,%xmm4							# mask for NaNs
+		movdqa	%xmm4,%xmm2
+		movdqa	p_x2(%rsp),%xmm7			# isolate the NaNs
+		pand	%xmm4,%xmm7
+
+		pand	.L__real_qnanbit(%rip),%xmm4		# now we have a mask that will set QNaN bit
+		por		%xmm7,%xmm4							# turn SNaNs to QNaNs
+
+		movdqa	%xmm2,%xmm7
+		andnps	%xmm1,%xmm2							# keep the non-error values
+		orps	%xmm4,%xmm2							# merge
+		movdqa	%xmm2,%xmm1							# and replace
+		xorps	%xmm4,%xmm4
+
+		jmp		.L__f22
+
+	# handle only +inf	 log(+inf) = inf
+.L__log_inf2:
+		movdqa	%xmm9,%xmm7
+		andnps	%xmm1,%xmm9							# keep the non-error values
+		andps	p_x2(%rsp),%xmm7			# setup the +inf values
+		orps	%xmm9,%xmm7							# merge
+		movdqa	%xmm7,%xmm1							# and replace
+		jmp		.L__f32
+
+
+        .data
+        .align 64
+
+
+.L__real_zero:				.quad 0x00000000000000000	# 1.0
+					.quad 0x00000000000000000
+.L__real_one:				.quad 0x03f8000003f800000	# 1.0
+					.quad 0x03f8000003f800000
+.L__real_two:				.quad 0x04000000040000000	# 1.0
+					.quad 0x04000000040000000
+.L__real_ninf:				.quad 0x0ff800000ff800000	# -inf
+					.quad 0x0ff800000ff800000
+.L__real_inf:				.quad 0x07f8000007f800000	# +inf
+					.quad 0x07f8000007f800000
+.L__real_nan:				.quad 0x07fc000007fc00000	# NaN
+					.quad 0x07fc000007fc00000
+.L__real_ef:				.quad 0x0402DF854402DF854	# float e
+					.quad 0x0402DF854402DF854
+
+.L__real_sign:				.quad 0x08000000080000000	# sign bit
+					.quad 0x08000000080000000
+.L__real_notsign:			.quad 0x07ffFFFFF7ffFFFFF	# ^sign bit
+					.quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit:			.quad 0x00040000000400000	# quiet nan bit
+					.quad 0x00040000000400000
+.L__real_mant:				.quad 0x0007FFFFF007FFFFF	# mantipsa bits
+					.quad 0x0007FFFFF007FFFFF
+.L__real_3c000000:			.quad 0x03c0000003c000000	# /* 0.0078125 = 1/128 */
+					.quad 0x03c0000003c000000
+.L__mask_127:				.quad 0x00000007f0000007f	#
+					.quad 0x00000007f0000007f
+.L__mask_040:				.quad 0x00000004000000040	#
+					.quad 0x00000004000000040
+.L__mask_001:				.quad 0x00000000100000001	#
+					.quad 0x00000000100000001
+
+
+.L__real_threshold:			.quad 0x03CF5C28F3CF5C28F	# .03
+					.quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333317923934e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.25000000037717509602e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3:				.quad 0x03B1249183B124918	# 2.23213998791944806202e-03
+					.quad 0x03B1249183B124918
+.L__real_ca4:				.quad 0x039E401A639E401A6	# 4.34887777707614552256e-04
+					.quad 0x039E401A639E401A6
+.L__real_cb1:				.quad 0x03DAAAAAB3DAAAAAB	# 8.33333333333333593622e-02
+					.quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2:				.quad 0x03C4CCCCD3C4CCCCD	# 1.24999999978138668903e-02
+					.quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3:				.quad 0x03B124A123B124A12	# 2.23219810758559851206e-03
+					.quad 0x03B124A123B124A12
+.L__real_log2_lead:			.quad 0x03F3170003F317000  # 0.693115234375
+                        		.quad 0x03F3170003F317000
+.L__real_log2_tail:     		.quad 0x03805FDF43805FDF4  # 0.000031946183
+                       			.quad 0x03805FDF43805FDF4
+.L__real_half:				.quad 0x03f0000003f000000	# 1/2
+					.quad 0x03f0000003f000000
+
+
+.L__np_ln__table:
+	.quad	0x0000000000000000 		# 0.00000000000000000000e+00
+	.quad	0x3F8FC0A8B0FC03E4		# 1.55041813850402832031e-02
+	.quad	0x3F9F829B0E783300		# 3.07716131210327148438e-02
+	.quad	0x3FA77458F632DCFC		# 4.58095073699951171875e-02
+	.quad	0x3FAF0A30C01162A6		# 6.06245994567871093750e-02
+	.quad	0x3FB341D7961BD1D1		# 7.52233862876892089844e-02
+	.quad	0x3FB6F0D28AE56B4C		# 8.96121263504028320312e-02
+	.quad	0x3FBA926D3A4AD563		# 1.03796780109405517578e-01
+	.quad	0x3FBE27076E2AF2E6		# 1.17783010005950927734e-01
+	.quad	0x3FC0D77E7CD08E59		# 1.31576299667358398438e-01
+	.quad	0x3FC29552F81FF523		# 1.45181953907012939453e-01
+	.quad	0x3FC44D2B6CCB7D1E		# 1.58604979515075683594e-01
+	.quad	0x3FC5FF3070A793D4		# 1.71850204467773437500e-01
+	.quad	0x3FC7AB890210D909		# 1.84922337532043457031e-01
+	.quad	0x3FC9525A9CF456B4		# 1.97825729846954345703e-01
+	.quad	0x3FCAF3C94E80BFF3		# 2.10564732551574707031e-01
+	.quad	0x3FCC8FF7C79A9A22		# 2.23143517971038818359e-01
+	.quad	0x3FCE27076E2AF2E6		# 2.35566020011901855469e-01
+	.quad	0x3FCFB9186D5E3E2B		# 2.47836112976074218750e-01
+	.quad	0x3FD0A324E27390E3		# 2.59957492351531982422e-01
+	.quad	0x3FD1675CABABA60E		# 2.71933674812316894531e-01
+	.quad	0x3FD22941FBCF7966		# 2.83768117427825927734e-01
+	.quad	0x3FD2E8E2BAE11D31		# 2.95464158058166503906e-01
+	.quad	0x3FD3A64C556945EA		# 3.07025015354156494141e-01
+	.quad	0x3FD4618BC21C5EC2		# 3.18453729152679443359e-01
+	.quad	0x3FD51AAD872DF82D		# 3.29753279685974121094e-01
+	.quad	0x3FD5D1BDBF5809CA		# 3.40926527976989746094e-01
+	.quad	0x3FD686C81E9B14AF		# 3.51976394653320312500e-01
+	.quad	0x3FD739D7F6BBD007		# 3.62905442714691162109e-01
+	.quad	0x3FD7EAF83B82AFC3		# 3.73716354370117187500e-01
+	.quad	0x3FD89A3386C1425B		# 3.84411692619323730469e-01
+	.quad	0x3FD947941C2116FB		# 3.94993782043457031250e-01
+	.quad	0x3FD9F323ECBF984C		# 4.05465066432952880859e-01
+	.quad	0x3FDA9CEC9A9A084A		# 4.15827870368957519531e-01
+	.quad	0x3FDB44F77BCC8F63		# 4.26084339618682861328e-01
+	.quad	0x3FDBEB4D9DA71B7C		# 4.36236739158630371094e-01
+	.quad	0x3FDC8FF7C79A9A22		# 4.46287095546722412109e-01
+	.quad	0x3FDD32FE7E00EBD5		# 4.56237375736236572266e-01
+	.quad	0x3FDDD46A04C1C4A1		# 4.66089725494384765625e-01
+	.quad	0x3FDE744261D68788		# 4.75845873355865478516e-01
+	.quad	0x3FDF128F5FAF06ED		# 4.85507786273956298828e-01
+	.quad	0x3FDFAF588F78F31F		# 4.95077252388000488281e-01
+	.quad	0x3FE02552A5A5D0FF		# 5.04556000232696533203e-01
+	.quad	0x3FE0723E5C1CDF40		# 5.13945698738098144531e-01
+	.quad	0x3FE0BE72E4252A83		# 5.23248136043548583984e-01
+	.quad	0x3FE109F39E2D4C97		# 5.32464742660522460938e-01
+	.quad	0x3FE154C3D2F4D5EA		# 5.41597247123718261719e-01
+	.quad	0x3FE19EE6B467C96F		# 5.50647079944610595703e-01
+	.quad	0x3FE1E85F5E7040D0		# 5.59615731239318847656e-01
+	.quad	0x3FE23130D7BEBF43		# 5.68504691123962402344e-01
+	.quad	0x3FE2795E1289B11B		# 5.77315330505371093750e-01
+	.quad	0x3FE2C0E9ED448E8C		# 5.86049020290374755859e-01
+	.quad	0x3FE307D7334F10BE		# 5.94707071781158447266e-01
+	.quad	0x3FE34E289D9CE1D3		# 6.03290796279907226562e-01
+	.quad	0x3FE393E0D3562A1A		# 6.11801505088806152344e-01
+	.quad	0x3FE3D9026A7156FB		# 6.20240390300750732422e-01
+	.quad	0x3FE41D8FE84672AE		# 6.28608644008636474609e-01
+	.quad	0x3FE4618BC21C5EC2		# 6.36907458305358886719e-01
+	.quad	0x3FE4A4F85DB03EBB		# 6.45137906074523925781e-01
+	.quad	0x3FE4E7D811B75BB1		# 6.53301239013671875000e-01
+	.quad	0x3FE52A2D265BC5AB		# 6.61398470401763916016e-01
+	.quad	0x3FE56BF9D5B3F399		# 6.69430613517761230469e-01
+	.quad	0x3FE5AD404C359F2D		# 6.77398800849914550781e-01
+	.quad	0x3FE5EE02A9241675		# 6.85303986072540283203e-01
+	.quad	0x3FE62E42FEFA39EF		# 6.93147122859954833984e-01
+	.quad 0					# for alignment
+
+.L__np_ln_lead_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x3C7E0000  # 0.015502929688 1
+    .long 0x3CFC1000  # 0.030769348145 2
+    .long 0x3D3BA000  # 0.045806884766 3
+    .long 0x3D785000  # 0.060623168945 4
+    .long 0x3D9A0000  # 0.075195312500 5
+    .long 0x3DB78000  # 0.089599609375 6
+    .long 0x3DD49000  # 0.103790283203 7
+    .long 0x3DF13000  # 0.117767333984 8
+    .long 0x3E06B000  # 0.131530761719 9
+    .long 0x3E14A000  # 0.145141601563 10
+    .long 0x3E226000  # 0.158569335938 11
+    .long 0x3E2FF000  # 0.171813964844 12
+    .long 0x3E3D5000  # 0.184875488281 13
+    .long 0x3E4A9000  # 0.197814941406 14
+    .long 0x3E579000  # 0.210510253906 15
+    .long 0x3E647000  # 0.223083496094 16
+    .long 0x3E713000  # 0.235534667969 17
+    .long 0x3E7DC000  # 0.247802734375 18
+    .long 0x3E851000  # 0.259887695313 19
+    .long 0x3E8B3000  # 0.271850585938 20
+    .long 0x3E914000  # 0.283691406250 21
+    .long 0x3E974000  # 0.295410156250 22
+    .long 0x3E9D3000  # 0.307006835938 23
+    .long 0x3EA30000  # 0.318359375000 24
+    .long 0x3EA8D000  # 0.329711914063 25
+    .long 0x3EAE8000  # 0.340820312500 26
+    .long 0x3EB43000  # 0.351928710938 27
+    .long 0x3EB9C000  # 0.362792968750 28
+    .long 0x3EBF5000  # 0.373657226563 29
+    .long 0x3EC4D000  # 0.384399414063 30
+    .long 0x3ECA3000  # 0.394897460938 31
+    .long 0x3ECF9000  # 0.405395507813 32
+    .long 0x3ED4E000  # 0.415771484375 33
+    .long 0x3EDA2000  # 0.426025390625 34
+    .long 0x3EDF5000  # 0.436157226563 35
+    .long 0x3EE47000  # 0.446166992188 36
+    .long 0x3EE99000  # 0.456176757813 37
+    .long 0x3EEEA000  # 0.466064453125 38
+    .long 0x3EF3A000  # 0.475830078125 39
+    .long 0x3EF89000  # 0.485473632813 40
+    .long 0x3EFD7000  # 0.494995117188 41
+    .long 0x3F012000  # 0.504394531250 42
+    .long 0x3F039000  # 0.513916015625 43
+    .long 0x3F05F000  # 0.523193359375 44
+    .long 0x3F084000  # 0.532226562500 45
+    .long 0x3F0AA000  # 0.541503906250 46
+    .long 0x3F0CF000  # 0.550537109375 47
+    .long 0x3F0F4000  # 0.559570312500 48
+    .long 0x3F118000  # 0.568359375000 49
+    .long 0x3F13C000  # 0.577148437500 50
+    .long 0x3F160000  # 0.585937500000 51
+    .long 0x3F183000  # 0.594482421875 52
+    .long 0x3F1A7000  # 0.603271484375 53
+    .long 0x3F1C9000  # 0.611572265625 54
+    .long 0x3F1EC000  # 0.620117187500 55
+    .long 0x3F20E000  # 0.628417968750 56
+    .long 0x3F230000  # 0.636718750000 57
+    .long 0x3F252000  # 0.645019531250 58
+    .long 0x3F273000  # 0.653076171875 59
+    .long 0x3F295000  # 0.661376953125 60
+    .long 0x3F2B5000  # 0.669189453125 61
+    .long 0x3F2D6000  # 0.677246093750 62
+    .long 0x3F2F7000  # 0.685302734375 63
+    .long 0x3F317000  # 0.693115234375 64
+    .long 0					# for alignment
+
+.L__np_ln_tail_table:
+    .long 0x00000000  # 0.000000000000 0
+    .long 0x35A8B0FC  # 0.000001256848 1
+    .long 0x361B0E78  # 0.000002310522 2
+    .long 0x3631EC66  # 0.000002651266 3
+    .long 0x35C30046  # 0.000001452871 4
+    .long 0x37EBCB0E  # 0.000028108738 5
+    .long 0x37528AE5  # 0.000012549314 6
+    .long 0x36DA7496  # 0.000006510479 7
+    .long 0x3783B715  # 0.000015701671 8
+    .long 0x383F3E68  # 0.000045596069 9
+    .long 0x38297C10  # 0.000040408282 10
+    .long 0x3815B666  # 0.000035694240 11
+    .long 0x38183854  # 0.000036292084 12
+    .long 0x38448108  # 0.000046850211 13
+    .long 0x373539E9  # 0.000010801924 14
+    .long 0x3864A740  # 0.000054515200 15
+    .long 0x387BE3CD  # 0.000060055219 16
+    .long 0x3803B715  # 0.000031403342 17
+    .long 0x380C36AF  # 0.000033429529 18
+    .long 0x3892713A  # 0.000069829126 19
+    .long 0x38AE55D6  # 0.000083129547 20
+    .long 0x38A0FDE8  # 0.000076766883 21
+    .long 0x3862BAE1  # 0.000054056643 22
+    .long 0x3798AAD3  # 0.000018199358 23
+    .long 0x38C5E10E  # 0.000094356117 24
+    .long 0x382D872E  # 0.000041372310 25
+    .long 0x38DEDFAC  # 0.000106274470 26
+    .long 0x38481E9B  # 0.000047712219 27
+    .long 0x38EBFB5E  # 0.000112524940 28
+    .long 0x38783B83  # 0.000059183232 29
+    .long 0x374E1B05  # 0.000012284848 30
+    .long 0x38CA0E11  # 0.000096347307 31
+    .long 0x3891F660  # 0.000069600297 32
+    .long 0x386C9A9A  # 0.000056410769 33
+    .long 0x38777BCD  # 0.000059004688 34
+    .long 0x38A6CED4  # 0.000079540216 35
+    .long 0x38FBE3CD  # 0.000120110439 36
+    .long 0x387E7E01  # 0.000060675669 37
+    .long 0x37D40984  # 0.000025276800 38
+    .long 0x3784C3AD  # 0.000015826745 39
+    .long 0x380F5FAF  # 0.000034182969 40
+    .long 0x38AC47BC  # 0.000082149607 41
+    .long 0x392952D3  # 0.000161479504 42
+    .long 0x37F97073  # 0.000029735476 43
+    .long 0x3865C84A  # 0.000054784388 44
+    .long 0x3979CF17  # 0.000238236375 45
+    .long 0x38C3D2F5  # 0.000093376184 46
+    .long 0x38E6B468  # 0.000110008579 47
+    .long 0x383EBCE1  # 0.000045475437 48
+    .long 0x39186BDF  # 0.000145360347 49
+    .long 0x392F0945  # 0.000166927537 50
+    .long 0x38E9ED45  # 0.000111545007 51
+    .long 0x396B99A8  # 0.000224685878 52
+    .long 0x37A27674  # 0.000019367064 53
+    .long 0x397069AB  # 0.000229275480 54
+    .long 0x39013539  # 0.000123222257 55
+    .long 0x3947F423  # 0.000190690669 56
+    .long 0x3945E10E  # 0.000188712234 57
+    .long 0x38F85DB0  # 0.000118430122 58
+    .long 0x396C08DC  # 0.000225100142 59
+    .long 0x37B4996F  # 0.000021529120 60
+    .long 0x397CEADA  # 0.000241200818 61
+    .long 0x3920261B  # 0.000152729845 62
+    .long 0x35AA4906  # 0.000001268724 63
+    .long 0x3805FDF4  # 0.000031946183 64
+    .long 0					# for alignment
+

diff --git a/src/gas/vrsapowf.S b/src/gas/vrsapowf.S
new file mode 100644
index 0000000..3521a6b
--- /dev/null
+++ b/src/gas/vrsapowf.S

@@ -0,0 +1,782 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsapowf.asm
+#
+# An array implementation of the powf libm function.
+#
+# Prototype:
+#
+#     void vrsa_powf(int n, float *x, float *y, float *z);
+#
+#   Computes x raised to the y power.
+#
+#   Places the results into the supplied z array.
+# Does not perform error handling, but does return C99 values for error
+# inputs.   Denormal results are truncated to 0.
+
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ	p_temp,0x00		# xmmword
+.equ	p_negateres,0x10		# qword
+
+
+.equ	save_rbx,0x030		#qword
+
+
+.equ	p_ax,0x050		# absolute x
+.equ	p_sx,0x060		# sign of x's
+
+.equ	p_ay,0x070		# absolute y
+.equ	p_yexp,0x080		# unbiased exponent of y
+
+.equ	p_inty,0x090		# integer y indicators
+
+.equ	p_xptr,0x0a0		# ptr to x values
+.equ	p_yptr,0x0a8		# ptr to y values
+.equ	p_zptr,0x0b0		# ptr to z values
+
+.equ	p_nv,0x0b8		#qword
+.equ	p_iter,0x0c0		# qword	storage for number of loop iterations
+
+.equ	p2_temp,0x0d0		#qword
+.equ	p2_temp1,0x0f0		#qword
+
+.equ	stack_size,0x0118	# allocate 40h more than
+				# we need to avoid bank conflicts
+
+
+
+
+        .weak vrsa_powf_
+        .set vrsa_powf_,__vrsa_powf__
+        .weak vrsa_powf__
+        .set vrsa_powf__,__vrsa_powf__
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array powf
+#**     VRSA_POWF(N,X,Y,Z)
+#** C equivalent
+#*/
+#void vrsa_powf_(int * n, float *x, float *y, float *z)
+#{
+#       vrsa_powf(*n,x,y,z);
+#}
+
+.globl __vrsa_powf__
+    .type   __vrsa_powf__,@function
+__vrsa_powf__:
+	mov		(%rdi),%edi
+
+
+# parameters are passed in by Linux as:
+# edi - int    n
+# rsi - float *x
+# rdx - float *y
+# rcx - float *z
+
+.globl vrsa_powf
+    .type   vrsa_powf,@function
+vrsa_powf:
+
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+# save the arguments
+	mov		  %rsi,p_xptr(%rsp)		# save pointer to x
+	mov		  %rdx,p_yptr(%rsp)		# save pointer to y
+	mov		  %rcx,p_zptr(%rsp)		# save pointer to z
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+#endif
+
+	mov     %rax,%rcx
+	mov		%rcx,p_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$2,%rax						# get number of iterations
+	jz		.L__vsa_cleanup				# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rcx						# compute number of extra single calls
+	mov		%rcx,p_nv(%rsp)	# save number of left over values
+
+# process the array 4 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+# first get x
+	mov		p_xptr(%rsp),%rsi	# get x_array pointer
+	movups	(%rsi),%xmm0
+	prefetch	64(%rsi)
+
+	movaps	%xmm0,%xmm2
+	andps	.L__mask_nsign(%rip),%xmm0		# get abs x
+	andps	.L__mask_sign(%rip),%xmm2			# mask for the sign bits
+	movaps	  %xmm0,p_ax(%rsp)		# save them
+	movaps	  %xmm2,p_sx(%rsp)		# save them
+# convert all four x's to double
+	cvtps2pd   p_ax(%rsp),%xmm0
+	cvtps2pd   p_ax+8(%rsp),%xmm1
+#
+# classify y
+# vector 32 bit integer method	 25 cycles to here
+#  /* See whether y is an integer.
+#     inty = 0 means not an integer.
+#     inty = 1 means odd integer.
+#     inty = 2 means even integer.
+#  */
+	mov		p_yptr(%rsp),%rdi	# get y_array pointer
+	movups  (%rdi),%xmm4
+	prefetch	64(%rdi)
+	pxor	%xmm3,%xmm3
+	pand	.L__mask_nsign(%rip),%xmm4		# get abs y in integer format
+	movdqa    %xmm4,p_ay(%rsp)			# save it
+
+# see if the number is less than 1.0
+	psrld	$23,%xmm4			#>> EXPSHIFTBITS_SP32
+
+	psubd	.L__mask_127(%rip),%xmm4			# yexp, unbiased exponent
+	movdqa    %xmm4,p_yexp(%rsp)		# save it
+	paddd	.L__mask_1(%rip),%xmm4			# yexp+1
+	pcmpgtd	%xmm3,%xmm4		# 0 if exp less than 126 (2^0) (y < 1.0), else FFs
+# xmm4 is ffs if abs(y) >=1.0, else 0
+
+# see if the mantissa has fractional bits
+#build mask for mantissa
+	movdqa  .L__mask_23(%rip),%xmm2
+	psubd	p_yexp(%rsp),%xmm2		# 24-yexp
+	pmaxsw	%xmm3,%xmm2							# no shift counts less than 0
+	movdqa    %xmm2,p_temp(%rsp)		# save the shift counts
+# create mask for all four values
+# SSE can't individual shifts so have to do 0xeac one seperately
+	mov		p_temp(%rsp),%rcx
+	mov		$1,%rbx
+	shl		%cl,%ebx			#1 << (24 - yexp)
+	shr		$32,%rcx
+	mov		$1,%eax
+	shl		%cl,%eax			#1 << (24 - yexp)
+	shl		$32,%rax
+	add		%rax,%rbx
+	mov		%rbx,p_temp(%rsp)
+	mov		p_temp+8(%rsp),%rcx
+	mov		$1,%rbx
+	shl		%cl,%ebx			#1 << (24 - yexp)
+	shr		$32,%rcx
+	mov		$1,%eax
+	shl		%cl,%eax			#1 << (24 - yexp)
+	shl		$32,%rax
+	add		%rbx,%rax
+	mov		  %rax,p_temp+8(%rsp)
+	movdqa  p_temp(%rsp),%xmm5
+	psubd	.L__mask_1(%rip),%xmm5	#= mask = (1 << (24 - yexp)) - 1
+
+# now use the mask to see if there are any fractional bits
+	movdqu  (%rdi),%xmm2 # get uy
+	pand	%xmm5,%xmm2		# uy & mask
+	pcmpeqd	%xmm3,%xmm2		# 0 if not zero (y has fractional mantissa bits), else FFs
+	pand	%xmm4,%xmm2		# either 0s or ff
+# xmm2 now accounts for y< 1.0 or y>=1.0 and y has fractional mantissa bits,
+# it has the value 0 if we know it's non-integer or ff if integer.
+
+# now see if it's even or odd.
+
+## if yexp > 24, then it has to be even
+	movdqa  .L__mask_24(%rip),%xmm4
+	psubd	p_yexp(%rsp),%xmm4		# 24-yexp
+	paddd	.L__mask_1(%rip),%xmm5	# mask+1 = least significant integer bit
+	pcmpgtd	%xmm3,%xmm4		 # if 0, then must be even, else ff's
+
+ 	pand	%xmm4,%xmm5		# set the integer bit mask to zero if yexp>24
+ 	paddd	.L__mask_2(%rip),%xmm4
+ 	por		.L__mask_2(%rip),%xmm4
+ 	pand	%xmm2,%xmm4		 # result can be 0, 2, or 3
+
+# now for integer numbers, see if odd or even
+	pand	.L__mask_mant(%rip),%xmm5	# mask out exponent bits
+	movdqu (%rdi),%xmm2
+	pand    %xmm2,%xmm5 #  & uy -> even or odd
+	movdqa	.L__float_one(%rip),%xmm2
+	pcmpeqd	p_ay(%rsp),%xmm2	# is ay equal to 1, ff's if so, then it's odd
+	pand	.L__mask_nsign(%rip),%xmm2 # strip the sign bit so the gt comparison works.
+	por		%xmm2,%xmm5
+	pcmpgtd	%xmm3,%xmm5		 # if odd then ff's, else 0's for even
+	paddd	.L__mask_2(%rip),%xmm5 # gives us 2 for even, 1 for odd
+	pand	%xmm5,%xmm4
+
+	movdqa		  %xmm4,p_inty(%rsp)		# save inty
+#
+# do more x special case checking
+#
+	movdqa	%xmm4,%xmm5
+	pcmpeqd	%xmm3,%xmm5						# is not an integer? ff's if so
+	pand	.L__mask_NaN(%rip),%xmm5		# these values will be NaNs, if x<0
+	movdqa	%xmm4,%xmm2
+	pcmpeqd	.L__mask_1(%rip),%xmm2		# is it odd? ff's if so
+	pand	.L__mask_sign(%rip),%xmm2	# these values will get their sign bit set
+	por		%xmm2,%xmm5
+
+	pcmpeqd	p_sx(%rsp),%xmm3		# if the signs are set
+	pandn	%xmm5,%xmm3						# then negateres gets the values as shown below
+	movdqa	  %xmm3,p_negateres(%rsp)	# save negateres
+
+#  /* p_negateres now means the following.
+#     7FC00000 means x<0, y not an integer, return NaN.
+#     80000000 means x<0, y is odd integer, so set the sign bit.
+##     0 means even integer, and/or x>=0.
+#  */
+
+
+# **** Here starts the main calculations  ****
+# The algorithm used is x**y = exp(y*log(x))
+#  Extra precision is required in intermediate steps to meet the 1ulp requirement
+#
+# log(x) calculation
+	call		__vrd4_log@PLT					# get the double precision log value
+											# for all four x's
+# y* logx
+# convert all four y's to double
+#	mov	p_yptr(%rsp),%rdi		; get y_array pointer
+	cvtps2pd   (%rdi),%xmm2
+	cvtps2pd   8(%rdi),%xmm3
+
+#  /* just multiply by y */
+	mulpd	%xmm2,%xmm0
+	mulpd	%xmm3,%xmm1
+
+#  /* The following code computes r = exp(w) */
+	call		__vrd4_exp@PLT		# get the double exp value
+						# for all four y*log(x)'s
+	mov		p_xptr(%rsp),%rsi	# get x_array pointer
+	mov		p_yptr(%rsp),%rdi	# get y_array pointer
+#
+# convert all four results to double
+	cvtpd2ps	%xmm0,%xmm0
+	cvtpd2ps	%xmm1,%xmm1
+	movlhps		%xmm1,%xmm0
+
+# perform special case and error checking on input values
+
+# special case checking is done first in the scalar version since
+# it allows for early fast returns.  But for vectors, we consider them
+# to be rare, so early returns are not necessary.  So we first compute
+# the x**y values, and then check for special cases.
+
+# we do some of the checking in reverse order of the scalar version.
+# apply the negate result flags
+	orps	p_negateres(%rsp),%xmm0	# get negateres
+
+## if y is infinite or so large that the result would overflow or underflow
+	movdqa	p_ay(%rsp),%xmm4
+	cmpps	$5,.L__mask_ly(%rip),%xmm4	# y not less than large value, ffs if so.
+	movmskps %xmm4,%edx
+	test	$0x0f,%edx
+	jnz		.Ly_large
+.Lrnsx3:
+
+## if x is infinite
+	movdqa	p_ax(%rsp),%xmm4
+	cmpps	$0,.L__mask_inf(%rip),%xmm4	# equal to infinity, ffs if so.
+	movmskps %xmm4,%edx
+	test	$0x0f,%edx
+	jnz		.Lx_infinite
+.Lrnsx1:
+## if x is zero
+	xorps	%xmm4,%xmm4
+	cmpps	$0,p_ax(%rsp),%xmm4	# equal to zero, ffs if so.
+	movmskps %xmm4,%edx
+	test	$0x0f,%edx
+	jnz		.Lx_zero
+.Lrnsx2:
+## if y is NAN
+	movdqu	(%rdi),%xmm4		# get y
+	cmpps	$4,%xmm4,%xmm4		# a compare not equal  of y to itself should
+					# be false, unless y is a NaN. ff's if NaN.
+	movmskps %xmm4,%ecx
+	test	$0x0f,%ecx
+	jnz		.Ly_NaN
+.Lrnsx4:
+## if x is NAN
+	movdqu	(%rsi),%xmm4		# get x
+	cmpps	$4,%xmm4,%xmm4		# a compare not equal  of x to itself should
+					# be false, unless x is a NaN. ff's if NaN.
+	movmskps %xmm4,%ecx
+	test	$0x0f,%ecx
+	jnz		.Lx_NaN
+.Lrnsx5:
+
+## if |y| == 0	then return 1
+	movdqa	.L__float_one(%rip),%xmm3	# one
+	xorps	%xmm2,%xmm2
+	cmpps	$4,p_ay(%rsp),%xmm2	# not equal to 0.0?, ffs if not equal.
+	andps	%xmm2,%xmm0		# keep the others
+	andnps	%xmm3,%xmm2		# mask for ones
+	orps	%xmm2,%xmm0
+## if x == +1, return +1 for all x
+	movdqa	%xmm3,%xmm2
+	movdqu	(%rsi),%xmm5
+	cmpps	$4,%xmm5,%xmm2		# not equal to +1.0?, ffs if not equal.
+	andps	%xmm2,%xmm0		# keep the others
+	andnps	%xmm3,%xmm2		# mask for ones
+	orps	%xmm2,%xmm0
+
+.L__powf_cleanup2:
+
+# update the x and y pointers
+	add		$16,%rdi
+	add		$16,%rsi
+	mov		%rsi,p_xptr(%rsp)	# save x_array pointer
+	mov		%rdi,p_yptr(%rsp)	# save y_array pointer
+# store the result _m128d
+	mov		p_zptr(%rsp),%rdi	# get z_array pointer
+	movups	%xmm0,(%rdi)
+#	prefetchw	QWORD PTR [rdi+64]
+	prefetch	64(%rdi)
+	add		$16,%rdi
+	mov		%rdi,p_zptr(%rsp)	# save z_array pointer
+
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vsa_top
+
+
+# see if we need to do any extras
+	mov		p_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz		.L__vsa_cleanup
+
+.L__final_check:
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+	.align 16
+# we jump here when we have an odd number of log calls to make at the
+# end
+.L__vsa_cleanup:
+        mov             p_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__final_check         # exit if not
+
+	mov		p_xptr(%rsp),%rsi
+	mov		p_yptr(%rsp),%rdi
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+	xorps		%xmm0,%xmm0
+	movaps	  %xmm0,p2_temp(%rsp)
+	movaps	  %xmm0,p2_temp+16(%rsp)
+
+	mov		(%rsi),%ecx			# we know there's at least one
+	mov	 	%ecx,p2_temp(%rsp)
+	mov		(%rdi),%edx			# we know there's at least one
+	mov	 	%edx,p2_temp+16(%rsp)
+	cmp		$2,%rax
+	jl		.L__vsacg
+
+	mov		4(%rsi),%ecx			# do the second value
+	mov	 	%ecx,p2_temp+4(%rsp)
+	mov		4(%rdi),%edx			# we know there's at least one
+	mov	 	%edx,p2_temp+20(%rsp)
+	cmp		$3,%rax
+	jl		.L__vsacg
+
+	mov		8(%rsi),%ecx			# do the third value
+	mov	 	%ecx,p2_temp+8(%rsp)
+	mov		8(%rdi),%edx			# we know there's at least one
+	mov	 	%edx,p2_temp+24(%rsp)
+
+.L__vsacg:
+	mov		$4,%rdi				# parameter for N
+	lea		p2_temp(%rsp),%rsi	# &x parameter
+	lea		p2_temp+16(%rsp),%rdx	# &y parameter
+	lea		p2_temp1(%rsp),%rcx	# &z parameter
+	call	vrsa_powf@PLT			# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		p_zptr(%rsp),%rdi
+	mov		p_nv(%rsp),%rax	# get number of values
+	mov	 	p2_temp1(%rsp),%ecx
+	mov		%ecx,(%rdi)			# we know there's at least one
+	cmp		$2,%rax
+	jl		.L__vsacgf
+
+	mov	 	p2_temp1+4(%rsp),%ecx
+	mov		%ecx,4(%rdi)			# do the second value
+	cmp		$3,%rax
+	jl		.L__vsacgf
+
+	mov	 	p2_temp1+8(%rsp),%ecx
+	mov		%ecx,8(%rdi)			# do the third value
+
+.L__vsacgf:
+	jmp		.L__final_check
+
+	.align 16
+#      y is a NaN.
+.Ly_NaN:
+	mov		p_yptr(%rsp),%rdx		# get pointer to y
+	movdqu	(%rdx),%xmm4			# get y
+	movdqa	%xmm4,%xmm3
+	movdqa	%xmm4,%xmm5
+	movdqa	.L__mask_sigbit(%rip),%xmm2	# get the signalling bits
+	cmpps	$0,%xmm4,%xmm4			# a compare equal  of y to itself should
+						# be true, unless y is a NaN. 0's if NaN.
+	cmpps	$4,%xmm3,%xmm3			# compare not equal, ff's if NaN.
+	andps	%xmm4,%xmm0			# keep the other results
+	andps	%xmm3,%xmm2			# get just the right signalling bits
+	andps	%xmm5,%xmm3			# mask for the NaNs
+	orps	%xmm2,%xmm3			# convert to QNaNs
+	orps	%xmm3,%xmm0			# combine
+	jmp	   	.Lrnsx4
+
+#       y is a NaN.
+.Lx_NaN:
+	mov		p_xptr(%rsp),%rcx	# get pointer to x
+	movdqu	(%rcx),%xmm4			# get x
+	movdqa	%xmm4,%xmm3
+	movdqa	%xmm4,%xmm5
+	movdqa	.L__mask_sigbit(%rip),%xmm2	# get the signalling bits
+	cmpps	$0,%xmm4,%xmm4			# a compare equal  of x to itself should
+						# be true, unless x is a NaN. 0's if NaN.
+	cmpps	$4,%xmm3,%xmm3			# compare not equal, ff's if NaN.
+	andps	%xmm4,%xmm0			# keep the other results
+	andps	%xmm3,%xmm2			# get just the right signalling bits
+	andps	%xmm5,%xmm3			# mask for the NaNs
+	orps	%xmm2,%xmm3	# convert to QNaNs
+	orps	%xmm3,%xmm0						# combine
+	jmp	   	.Lrnsx5
+
+#       y is infinite or so large that the result would
+#         overflow or underflow.
+.Ly_large:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	test	$1,%edx
+	jz		.Lylrga
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		(%rcx),%eax
+	mov		(%rbx),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+.Lylrga:
+	test	$2,%edx
+	jz		.Lylrgb
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		4(%rcx),%eax
+	mov		4(%rbx),%ebx
+	mov		p_inty+4(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+.Lylrgb:
+	test	$4,%edx
+	jz		.Lylrgc
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		8(%rcx),%eax
+	mov		8(%rbx),%ebx
+	mov		p_inty+8(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+.Lylrgc:
+	test	$8,%edx
+	jz		.Lylrgd
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		12(%rcx),%eax
+	mov		12(%rbx),%ebx
+	mov		p_inty+12(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+.Lylrgd:
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx3
+
+# a subroutine to treat an individual x,y pair when y is large or infinity
+# assumes x in .Ly(%rip),%eax in ebx.
+# returns result in eax
+.Lnp_special6:
+# handle |x|==1 cases first
+	mov		$0x07FFFFFFF,%r8d
+	and		%eax,%r8d
+	cmp		$0x03f800000,%r8d		  # jump if |x| !=1
+	jnz		.Lnps6
+	mov		$0x03f800000,%eax		  # return 1 for all |x|==1
+	jmp 	.Lnpx64
+
+# cases where  |x| !=1
+.Lnps6:
+	mov		$0x07f800000,%ecx
+	xor		%eax,%eax							  # assume 0 return
+	test	$0x080000000,%ebx
+	jnz		.Lnps62				  # jump if y negative
+# y = +inf
+	cmp		$0x03f800000,%r8d
+	cmovg	%ecx,%eax				  # return inf if |x| < 1
+	jmp 	.Lnpx64
+.Lnps62:
+# y = -inf
+	cmp		$0x03f800000,%r8d
+	cmovl	%ecx,%eax				  # return inf if |x| < 1
+	jmp 	.Lnpx64
+
+.Lnpx64:
+	ret
+
+# handle cases where x is +/- infinity.  edx is the mask
+	.align 16
+.Lx_infinite:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	test	$1,%edx
+	jz		.Lxinfa
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		(%rcx),%eax
+	mov		(%rbx),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+.Lxinfa:
+	test	$2,%edx
+	jz		.Lxinfb
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		4(%rcx),%eax
+	mov		4(%rbx),%ebx
+	mov		p_inty+4(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+.Lxinfb:
+	test	$4,%edx
+	jz		.Lxinfc
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		8(%rcx),%eax
+	mov		8(%rbx),%ebx
+	mov		p_inty+8(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+.Lxinfc:
+	test	$8,%edx
+	jz		.Lxinfd
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		12(%rcx),%eax
+	mov		12(%rbx),%ebx
+	mov		p_inty+12(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+.Lxinfd:
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx1
+
+# a subroutine to treat an individual x,y pair when x is +/-infinity
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+.Lnp_special_x1:				# x is infinite
+	test	$0x080000000,%eax		# is x positive
+	jnz		.Lnsx11			# jump if not
+	test	$0x080000000,%ebx		# is y positive
+	jz		.Lnsx13			# just return if so
+	xor		%eax,%eax		# else return 0
+	jmp 	.Lnsx13
+
+.Lnsx11:
+	cmp		$1,%ecx			# if inty ==1
+	jnz		.Lnsx12			# jump if not
+	test	$0x080000000,%ebx		# is y positive
+	jz		.Lnsx13			# just return if so
+	mov		$0x080000000,%eax	# else return -0
+	jmp 	.Lnsx13
+.Lnsx12:					# inty <>1
+	and		$0x07FFFFFFF,%eax	# return -x (|x|)  if y<0
+	test	$0x080000000,%ebx		# is y positive
+	jz		.Lnsx13			#
+	xor		%eax,%eax		# return 0  if y >=0
+.Lnsx13:
+	ret
+
+
+# handle cases where x is +/- zero.  edx is the mask of x,y pairs with |x|=0
+	.align 16
+.Lx_zero:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	test	$1,%edx
+	jz		.Lxzera
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		(%rcx),%eax
+	mov		(%rbx),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+.Lxzera:
+	test	$2,%edx
+	jz		.Lxzerb
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		4(%rcx),%eax
+	mov		4(%rbx),%ebx
+	mov		p_inty+4(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+.Lxzerb:
+	test	$4,%edx
+	jz		.Lxzerc
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		8(%rcx),%eax
+	mov		8(%rbx),%ebx
+	mov		p_inty+8(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+.Lxzerc:
+	test	$8,%edx
+	jz		.Lxzerd
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_yptr(%rsp),%rbx		# get pointer to y
+	mov		12(%rcx),%eax
+	mov		12(%rbx),%ebx
+	mov		p_inty+12(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+.Lxzerd:
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx2
+
+# a subroutine to treat an individual x,y pair when x is +/-0
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+	.align 16
+.Lnp_special_x2:
+	cmp		$1,%ecx				# if inty ==1
+	jz		.Lnsx21				# jump if so
+# handle cases of x=+/-0, y not integer
+	xor		%eax,%eax
+	mov		$0x07f800000,%ecx
+	test	$0x080000000,%ebx			# is ypos
+	cmovnz	%ecx,%eax
+	jmp		.Lnsx23
+# y is an integer
+.Lnsx21:
+	xor		%r8d,%r8d
+	mov		$0x07f800000,%ecx
+	test	$0x080000000,%ebx			# is ypos
+	cmovnz	%ecx,%r8d				# set to infinity if not
+	and		$0x080000000,%eax		# pickup the sign of x
+	or		%r8d,%eax			# and include it in the result
+.Lnsx23:
+	ret
+
+
+
+        .data
+        .align 64
+
+.L__mask_sign:                  .quad 0x08000000080000000       # a sign bit mask
+                                .quad 0x08000000080000000
+
+.L__mask_nsign:                 .quad 0x07FFFFFFF7FFFFFFF       # a not sign bit mask
+                                .quad 0x07FFFFFFF7FFFFFFF
+
+# used by inty
+.L__mask_127:                   .quad 0x00000007F0000007F       # EXPBIAS_SP32
+                                .quad 0x00000007F0000007F
+
+.L__mask_mant:                  .quad 0x0007FFFFF007FFFFF       # mantissa bit mask
+                                .quad 0x0007FFFFF007FFFFF
+
+.L__mask_1:                     .quad 0x00000000100000001       # 1
+                                .quad 0x00000000100000001
+
+.L__mask_2:                     .quad 0x00000000200000002       # 2
+                                .quad 0x00000000200000002
+
+.L__mask_24:                    .quad 0x00000001800000018       # 24
+                                .quad 0x00000001800000018
+
+.L__mask_23:                    .quad 0x00000001700000017       # 23
+                                .quad 0x00000001700000017
+
+# used by special case checking
+
+.L__float_one:                  .quad 0x03f8000003f800000       # one
+                                .quad 0x03f8000003f800000
+
+.L__mask_inf:                   .quad 0x07f8000007F800000       # inifinity
+                                .quad 0x07f8000007F800000
+
+.L__mask_NaN:                   .quad 0x07fC000007FC00000       # NaN
+                                .quad 0x07fC000007FC00000
+
+.L__mask_sigbit:                .quad 0x00040000000400000       # QNaN bit
+                                .quad 0x00040000000400000
+
+.L__mask_ly:                    .quad 0x04f0000004f000000       # large y
+                                .quad 0x04f0000004f000000
+
+

diff --git a/src/gas/vrsapowxf.S b/src/gas/vrsapowxf.S
new file mode 100644
index 0000000..4f67daf
--- /dev/null
+++ b/src/gas/vrsapowxf.S

@@ -0,0 +1,753 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsapowxf.asm
+#
+# An array implementation of the powf libm function.
+# This routine raises the x array to a constant y power.
+#
+# Prototype:
+#
+#     void vrsa_powxf(int n, float *x, float y, float *z);
+#
+#   Places the results into the supplied z array.
+# Does not perform error handling, but does return C99 values for error
+# inputs.   Denormal results are truncated to 0.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ	p_temp,0x00		# xmmword
+.equ	p_negateres,0x10		# qword
+
+.equ	p_xexp,0x20		# qword
+
+.equ	save_rbx,0x030		#qword
+
+.equ	p_y,0x048		# y value
+
+.equ	p_ax,0x050		# absolute x
+.equ	p_sx,0x060		# sign of x's
+
+.equ	p_ay,0x070		# absolute y
+.equ	p_yexp,0x080		# unbiased exponent of y
+
+.equ	p_inty,0x090		# integer y indicator
+
+.equ	p_xptr,0x0a0		# ptr to x values
+.equ	p_zptr,0x0b0		# ptr to z values
+
+.equ	p_nv,0x0b8		#qword
+.equ	p_iter,0x0c0		# qword	storage for number of loop iterations
+
+.equ	p2_temp,0x0d0		#qword
+.equ	p2_temp1,0x0f0		#qword
+
+.equ	stack_size,0x0118	# allocate 40h more than
+								# we need to avoid bank conflicts
+
+
+
+
+     .weak vrsa_powxf_
+     .set vrsa_powxf_,__vrsa_powxf__
+     .weak vrsa_powxf__
+     .set vrsa_powxf__,__vrsa_powxf__
+
+    .text
+    .align 16
+    .p2align 4,,15
+.globl __vrsa_powxf__
+    .type   __vrsa_powxf__,@function
+__vrsa_powxf__:
+
+#/* a FORTRAN subroutine implementation of array powf
+#**     VRSA_POWXF(N,X,Y,Z)
+#** C equivalent
+#*/
+#void vrsa_powxf_(int * n, float *x, float *y, float *z)
+#{
+#       vrsa_powxf(*n,x,y,z);
+#}
+# parameters are passed in by Linux FORTRAN  as:
+# edi  - int    n
+# rsi  - float *x
+# rdx  - float *y
+# rcx  - float *z
+        mov             (%rdi),%edi
+        movss           (%rdx),%xmm0
+        mov             %rcx,%rdx
+
+
+
+
+# parameters are passed in by Linux C as:
+# edi  - int    n
+# rsi  - float *x
+# xmm0 - float y
+# rdx  - float *z
+
+.globl vrsa_powxf
+    .type   vrsa_powxf,@function
+vrsa_powxf:
+
+	sub		$stack_size,%rsp
+	mov		%rbx,save_rbx(%rsp)	# save rbx
+
+	movss	  %xmm0,p_y(%rsp)		# save y
+	mov		  %rsi,p_xptr(%rsp)		# save pointer to x
+	mov		  %rdx,p_zptr(%rsp)		# save pointer to z
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+#endif
+	test		%rax,%rax		# just return if count is zero
+        jz              .L__final_check         # exit if not
+
+	mov     %rax,%rcx
+	mov		%rcx,p_nv(%rsp)	# save number of values
+
+#
+# classify y
+# vector 32 bit integer method
+#  /* See whether y is an integer.
+#     inty = 0 means not an integer.
+#     inty = 1 means odd integer.
+#     inty = 2 means even integer.
+#  */
+#	movdqa  .LXMMWORD(%rip),%xmm4 PTR [rdx]
+# get yexp
+	mov		p_y(%rsp),%r8d						# r8 is uy
+	mov		$0x07fffffff,%r9d
+	and		%r8d,%r9d						# r9 is ay
+
+## if |y| == 0	then return 1
+	cmp		$0,%r9d			# is y a zero?
+	jz		.Ly_zero
+
+	mov		$0x07f800000,%eax				# EXPBITS_SP32
+	and		%r9d,%eax						# y exp
+
+	xor		%edi,%edi
+	shr		$23,%eax			#>> EXPSHIFTBITS_SP32
+	sub		$126,%eax		# - EXPBIAS_SP32 + 1   - eax is now the unbiased exponent
+	mov		$1,%ebx
+	cmp		%ebx,%eax			# if (yexp < 1)
+	cmovl	%edi,%ebx
+	jl		.Lsave_inty
+
+	mov		$24,%ecx
+	cmp		%ecx,%eax			# if (yexp >24)
+	jle		.Lcly1
+	mov		$2,%ebx
+	jmp		.Lsave_inty
+.Lcly1:							# else 1<=yexp<=24
+	sub		%eax,%ecx			# build mask for mantissa
+	shl		%cl,%ebx
+	dec		%ebx				# rbx = mask = (1 << (24 - yexp)) - 1
+
+	mov		%r8d,%eax
+	and		%ebx,%eax			# if ((uy & mask) != 0)
+	cmovnz	%edi,%ebx			#   inty = 0;
+	jnz		.Lsave_inty
+
+	not		%ebx				# else if (((uy & ~mask) >> (24 - yexp)) & 0x00000001)
+	mov		%r8d,%eax
+	and		%ebx,%eax
+	shr		%cl,%eax
+	inc		%edi
+	and		%edi,%eax
+	mov		%edi,%ebx			#  inty = 1
+	jnz		.Lsave_inty
+	inc		%ebx				# else	inty = 2
+
+
+.Lsave_inty:
+	mov		 %r8d,p_y+4(%rsp)		# save an extra copy of y
+	mov		 %ebx,p_inty(%rsp)		# save inty
+
+	mov		p_nv(%rsp),%rax	# get number of values
+	mov     %rax,%rcx
+# see if too few values to call the main loop
+	shr		$2,%rax						# get number of iterations
+	jz		.L__vsa_cleanup				# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rcx						# compute number of extra single calls
+	mov		%rcx,p_nv(%rsp)	# save number of left over values
+
+# process the array 4 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+# first get x
+	mov		p_xptr(%rsp),%rsi	# get x_array pointer
+	movups	(%rsi),%xmm0
+	prefetch	64(%rsi)
+
+
+	movaps	%xmm0,%xmm2
+	andps	.L__mask_nsign(%rip),%xmm0		# get abs x
+	andps	.L__mask_sign(%rip),%xmm2			# mask for the sign bits
+	movaps	  %xmm0,p_ax(%rsp)		# save them
+	movaps	  %xmm2,p_sx(%rsp)		# save them
+# convert all four x's to double
+	cvtps2pd   p_ax(%rsp),%xmm0
+	cvtps2pd   p_ax+8(%rsp),%xmm1
+#
+# do x special case checking
+#
+#	movdqa	%xmm4,%xmm5
+#	pcmpeqd	%xmm3,%xmm5						; is y not an integer? ff's if so
+#	pand	.LXMMWORD(%rip),%xmm5 PTR __mask_NaN		; these values will be NaNs, if x<0
+	pxor	%xmm3,%xmm3
+	xor		%eax,%eax
+	mov		$0x07FC00000,%ecx
+	cmp		$0,%ebx							# is y not an integer?
+	cmovz	%ecx,%eax							# then set to return a NaN.  else 0.
+	mov		$0x080000000,%ecx
+	cmp		$1,%ebx							# is y an odd integer?
+	cmovz	%ecx,%eax							# maybe set sign bit if so
+	movd	%eax,%xmm5
+	pshufd	$0,%xmm5,%xmm5
+#	shufps	xmm5,%xmm5
+#	movdqa	%xmm4,%xmm2
+#	pcmpeqd	.LXMMWORD(%rip),%xmm2 PTR __mask_1		; is it odd? ff's if so
+#	pand	.LXMMWORD(%rip),%xmm2 PTR __mask_sign	; these values might get their sign bit set
+#	por		%xmm2,%xmm5
+
+#	cmpps	xmm3,XMMWORD PTR p_sx[rsp],0	; if the signs are set
+	pcmpeqd	p_sx(%rsp),%xmm3		# if the signs are set
+	pandn	%xmm5,%xmm3						# then negateres gets the values as shown below
+	movdqa	  %xmm3,p_negateres(%rsp)	# save negateres
+
+#  /* p_negateres now means the following.
+#     7FC00000 means x<0, y not an integer, return NaN.
+#     80000000 means x<0, y is odd integer, so set the sign bit.
+##     0 means even integer, and/or x>=0.
+#  */
+
+# **** Here starts the main calculations  ****
+# The algorithm used is x**y = exp(y*log(x))
+#  Extra precision is required in intermediate steps to meet the 1ulp requirement
+#
+# log(x) calculation
+	call		__vrd4_log@PLT		# get the double precision log value
+						# for all four x's
+# y* logx
+	cvtps2pd   p_y(%rsp),%xmm2	#convert the two packed single y's to double
+
+#  /* just multiply by y */
+	mulpd	%xmm2,%xmm0
+	mulpd	%xmm2,%xmm1
+
+#  /* The following code computes r = exp(w) */
+	call		__vrd4_exp@PLT		# get the double exp value
+						# for all four y*log(x)'s
+        mov             p_xptr(%rsp),%rsi       # get x_array pointer
+
+#
+# convert all four results to double
+	cvtpd2ps	%xmm0,%xmm0
+	cvtpd2ps	%xmm1,%xmm1
+	movlhps		%xmm1,%xmm0
+
+# perform special case and error checking on input values
+
+# special case checking is done first in the scalar version since
+# it allows for early fast returns.  But for vectors, we consider them
+# to be rare, so early returns are not necessary.  So we first compute
+# the x**y values, and then check for special cases.
+
+# we do some of the checking in reverse order of the scalar version.
+# apply the negate result flags
+	orps	p_negateres(%rsp),%xmm0	# get negateres
+
+## if y is infinite or so large that the result would overflow or underflow
+	mov		p_y(%rsp),%edx			# get y
+	and 	$0x07fffffff,%edx					# develop ay
+#	mov		$0x04f000000,%eax
+	cmp		$0x04f000000,%edx
+	ja		.Ly_large
+.Lrnsx3:
+
+## if x is infinite
+	movdqa	p_ax(%rsp),%xmm4
+	cmpps	$0,.L__mask_inf(%rip),%xmm4	# equal to infinity, ffs if so.
+	movmskps %xmm4,%edx
+	test	$0x0f,%edx
+	jnz		.Lx_infinite
+.Lrnsx1:
+## if x is zero
+	xorps	%xmm4,%xmm4
+	cmpps	$0,p_ax(%rsp),%xmm4	# equal to zero, ffs if so.
+	movmskps %xmm4,%edx
+	test	$0x0f,%edx
+	jnz		.Lx_zero
+.Lrnsx2:
+## if y is NAN
+	movss	p_y(%rsp),%xmm4			# get y
+	ucomiss	%xmm4,%xmm4						# comparing y to itself should
+											# be true, unless y is a NaN. parity flag if NaN.
+	jp		.Ly_NaN
+.Lrnsx4:
+## if x is NAN
+	movdqa	p_ax(%rsp),%xmm4			# get x
+	cmpps	$4,%xmm4,%xmm4						# a compare not equal  of x to itself should
+											# be false, unless x is a NaN. ff's if NaN.
+	movmskps %xmm4,%ecx
+	test	$0x0f,%ecx
+	jnz		.Lx_NaN
+.Lrnsx5:
+
+## if x == +1, return +1 for all x
+	movdqa	.L__float_one(%rip),%xmm3	# one
+	mov		p_xptr(%rsp),%rdx		# get pointer to x
+	movdqa	%xmm3,%xmm2
+	movdqu	(%rdx), %xmm5
+	cmpps	$4,%xmm5,%xmm2		# not equal to +1.0?, ffs if not equal.
+	andps	%xmm2,%xmm0						# keep the others
+	andnps	%xmm3,%xmm2						# mask for ones
+	orps	%xmm2,%xmm0
+
+.L__vsa_bottom:
+
+# update the x and y pointers
+	add		$16,%rsi
+	mov		%rsi,p_xptr(%rsp)	# save x_array pointer
+# store the result _m128d
+	mov		p_zptr(%rsp),%rdi	# get z_array pointer
+	movups	%xmm0,(%rdi)
+#	prefetchw	QWORD PTR [rdi+64]
+	prefetch	64(%rdi)
+	add		$16,%rdi
+	mov		%rdi,p_zptr(%rsp)	# save z_array pointer
+
+
+	mov		p_iter(%rsp),%rax	# get number of iterations
+	sub		$1,%rax
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	jnz		.L__vsa_top
+
+
+# see if we need to do any extras
+	mov		p_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz		.L__vsa_cleanup
+
+.L__final_check:
+
+	mov		save_rbx(%rsp),%rbx		# restore rbx
+	add		$stack_size,%rsp
+	ret
+
+	.align 16
+# we jump here when we have an odd number of calls to make at the
+# end
+.L__vsa_cleanup:
+        mov             p_nv(%rsp),%rax      # get number of values
+
+	mov		p_xptr(%rsp),%rsi
+	mov		p_y(%rsp),%r8d						# r8 is uy
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+	xorps		%xmm0,%xmm0
+	movaps	  %xmm0,p2_temp(%rsp)
+	movaps	  %xmm0,p2_temp+16(%rsp)
+
+	mov		(%rsi),%ecx			# we know there's at least one
+	mov	 	%ecx,p2_temp(%rsp)
+	mov	 	%r8d,p2_temp+16(%rsp)
+	cmp		$2,%rax
+	jl		.L__vsacg
+
+	mov		4(%rsi),%ecx			# do the second value
+	mov	 	%ecx,p2_temp+4(%rsp)
+	mov	 	%r8d,p2_temp+20(%rsp)
+	cmp		$3,%rax
+	jl		.L__vsacg
+
+	mov		8(%rsi),%ecx			# do the third value
+	mov	 	%ecx,p2_temp+8(%rsp)
+	mov	 	%r8d,p2_temp+24(%rsp)
+
+.L__vsacg:
+	mov		$4,%rdi			# parameter for N
+	lea		p2_temp(%rsp),%rsi	# &x parameter
+	movaps	p2_temp+16(%rsp),%xmm0		# y parameter
+	lea		p2_temp1(%rsp),%rdx	# &z parameter
+	call	vrsa_powxf@PLT			# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		p_zptr(%rsp),%rdi
+	mov		p_nv(%rsp),%rax		# get number of values
+	mov	 	p2_temp1(%rsp),%ecx
+	mov		%ecx,(%rdi)			# we know there's at least one
+	cmp		$2,%rax
+	jl		.L__vsacgf
+
+	mov	 	p2_temp1+4(%rsp),%ecx
+	mov		%ecx,4(%rdi)			# do the second value
+	cmp		$3,%rax
+	jl		.L__vsacgf
+
+	mov	 	p2_temp1+8(%rsp),%ecx
+	mov		%ecx,8(%rdi)			# do the third value
+
+.L__vsacgf:
+	jmp		.L__final_check
+
+
+	.align 16
+.Ly_zero:
+## if |y| == 0	then return 1
+	mov		$0x03f800000,%ecx	# one
+# fill all results with a one
+	mov		p_zptr(%rsp),%r9	# &z parameter
+	mov		p_nv(%rsp),%rax	# get number of values
+.L__yzt:
+	mov		%ecx,(%r9)			# store a 1
+	add		$4,%r9
+	sub		$1,%rax
+	test	%rax,%rax
+	jnz		.L__yzt
+	jmp		.L__final_check
+#       y is a NaN.
+.Ly_NaN:
+	mov		p_y(%rsp),%r8d
+	or		$0x000400000,%r8d	# convert to QNaNs
+	movd	%r8d,%xmm0			# propagate to all results
+	shufps	$0,%xmm0,%xmm0
+	jmp	   	.Lrnsx4
+
+#       x is a NaN.
+.Lx_NaN:
+	mov		p_xptr(%rsp),%rcx	# get pointer to x
+	movdqu	(%rcx),%xmm4			# get x
+	movdqa	%xmm4,%xmm3
+	movdqa	%xmm4,%xmm5
+	movdqa	.L__mask_sigbit(%rip),%xmm2	# get the signalling bits
+	cmpps	$0,%xmm4,%xmm4		# a compare equal  of x to itself should
+											# be true, unless x is a NaN. 0's if NaN.
+	cmpps	$4,%xmm3,%xmm3		# compare not equal, ff's if NaN.
+	andps	%xmm4,%xmm0		# keep the other results
+	andps	%xmm3,%xmm2		# get just the right signalling bits
+	andps	%xmm5,%xmm3		# mask for the NaNs
+	orps	%xmm2,%xmm3		# convert to QNaNs
+	orps	%xmm3,%xmm0		# combine
+	jmp	   	.Lrnsx5
+
+#       y is infinite or so large that the result would
+#         overflow or underflow.
+.Ly_large:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		(%rcx),%eax
+	mov		p_y(%rsp),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		4(%rcx),%eax
+	mov		p_y(%rsp),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		8(%rcx),%eax
+	mov		p_y(%rsp),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		12(%rcx),%eax
+	mov		p_y(%rsp),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special6				# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx3
+
+# a subroutine to treat an individual x,y pair when y is large or infinity
+# assumes x in .Ly(%rip),%eax in ebx.
+# returns result in eax
+.Lnp_special6:
+# handle |x|==1 cases first
+	mov		$0x07FFFFFFF,%r8d
+	and		%eax,%r8d
+	cmp		$0x03f800000,%r8d	  # jump if |x| !=1
+	jnz		.Lnps6
+	mov		$0x03f800000,%eax	  # return 1 for all |x|==1
+	jmp 	.Lnpx64
+
+# cases where  |x| !=1
+.Lnps6:
+	mov		$0x07f800000,%ecx
+	xor		%eax,%eax	  # assume 0 return
+	test	$0x080000000,%ebx
+	jnz		.Lnps62		  # jump if y negative
+# y = +inf
+	cmp		$0x03f800000,%r8d
+	cmovg	%ecx,%eax		  # return inf if |x| < 1
+	jmp 	.Lnpx64
+.Lnps62:
+# y = -inf
+	cmp		$0x03f800000,%r8d
+	cmovl	%ecx,%eax		  # return inf if |x| < 1
+	jmp 	.Lnpx64
+
+.Lnpx64:
+	ret
+
+# handle cases where x is +/- infinity.  edx is the mask
+	.align 16
+.Lx_infinite:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	test	$1,%edx
+	jz		.Lxinfa
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		(%rcx),%eax
+	mov		p_y(%rsp),%ebx
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1			# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+.Lxinfa:
+	test	$2,%edx
+	jz		.Lxinfb
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		4(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1			# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+.Lxinfb:
+	test	$4,%edx
+	jz		.Lxinfc
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		8(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1			# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+.Lxinfc:
+	test	$8,%edx
+	jz		.Lxinfd
+	mov		p_xptr(%rsp),%rcx		# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		12(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x1			# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+.Lxinfd:
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx1
+
+# a subroutine to treat an individual x,y pair when x is +/-infinity
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+.Lnp_special_x1:			# x is infinite
+	test	$0x080000000,%eax	# is x positive
+	jnz		.Lnsx11		# jump if not
+	test	$0x080000000,%ebx	# is y positive
+	jz		.Lnsx13		# just return if so
+	xor		%eax,%eax	# else return 0
+	jmp 	.Lnsx13
+
+.Lnsx11:
+	cmp		$1,%ecx		# if inty ==1
+	jnz		.Lnsx12		# jump if not
+	test	$0x080000000,%ebx	# is y positive
+	jz		.Lnsx13		# just return if so
+	mov		$0x080000000,%eax	# else return -0
+	jmp 	.Lnsx13
+.Lnsx12:				# inty <>1
+	and		$0x07FFFFFFF,%eax	# return -x (|x|)  if y<0
+	test	$0x080000000,%ebx	# is y positive
+	jz		.Lnsx13		#
+	xor		%eax,%eax	# return 0  if y >=0
+.Lnsx13:
+	ret
+
+
+# handle cases where x is +/- zero.  edx is the mask of x,y pairs with |x|=0
+	.align 16
+.Lx_zero:
+	movdqa	  %xmm0,p_temp(%rsp)
+
+	test	$1,%edx
+	jz		.Lxzera
+	mov		p_xptr(%rsp),%rcx	# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2			# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp(%rsp)
+.Lxzera:
+	test	$2,%edx
+	jz		.Lxzerb
+	mov		p_xptr(%rsp),%rcx	# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		4(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2			# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+4(%rsp)
+.Lxzerb:
+	test	$4,%edx
+	jz		.Lxzerc
+	mov		p_xptr(%rsp),%rcx	# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		8(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2			# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+8(%rsp)
+.Lxzerc:
+	test	$8,%edx
+	jz		.Lxzerd
+	mov		p_xptr(%rsp),%rcx	# get pointer to x
+	mov		p_y(%rsp),%ebx
+	mov		12(%rcx),%eax
+	mov		p_inty(%rsp),%ecx
+	sub		$8,%rsp
+	call	.Lnp_special_x2			# call the handler for one value
+	add		$8,%rsp
+	mov		  %eax,p_temp+12(%rsp)
+.Lxzerd:
+	movdqa	p_temp(%rsp),%xmm0
+	jmp 	.Lrnsx2
+
+# a subroutine to treat an individual x,y pair when x is +/-0
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+	.align 16
+.Lnp_special_x2:
+	cmp		$1,%ecx			# if inty ==1
+	jz		.Lnsx21			# jump if so
+# handle cases of x=+/-0, y not integer
+	xor		%eax,%eax
+	mov		$0x07f800000,%ecx
+	test	$0x080000000,%ebx		# is ypos
+	cmovnz	%ecx,%eax
+	jmp		.Lnsx23
+# y is an integer
+.Lnsx21:
+	xor		%r8d,%r8d
+	mov		$0x07f800000,%ecx
+	test	$0x080000000,%ebx		# is ypos
+	cmovnz	%ecx,%r8d			# set to infinity if not
+	and		$0x080000000,%eax	# pickup the sign of x
+	or		%r8d,%eax		# and include it in the result
+.Lnsx23:
+	ret
+
+        .data
+        .align 64
+
+.L__mask_sign:		.quad 0x08000000080000000	# a sign bit mask
+			.quad 0x08000000080000000
+
+.L__mask_nsign:		.quad 0x07FFFFFFF7FFFFFFF	# a not sign bit mask
+			.quad 0x07FFFFFFF7FFFFFFF
+
+# used by inty
+.L__mask_127:		.quad 0x00000007F0000007F	# EXPBIAS_SP32
+			.quad 0x00000007F0000007F
+
+.L__mask_mant:		.quad 0x0007FFFFF007FFFFF	# mantissa bit mask
+			.quad 0x0007FFFFF007FFFFF
+
+.L__mask_1:		.quad 0x00000000100000001	# 1
+			.quad 0x00000000100000001
+
+.L__mask_2:		.quad 0x00000000200000002	# 2
+			.quad 0x00000000200000002
+
+.L__mask_24:		.quad 0x00000001800000018	# 24
+			.quad 0x00000001800000018
+
+.L__mask_23:		.quad 0x00000001700000017	# 23
+			.quad 0x00000001700000017
+
+# used by special case checking
+
+.L__float_one:		.quad 0x03f8000003f800000	# one
+			.quad 0x03f8000003f800000
+
+.L__mask_inf:		.quad 0x07f8000007F800000	# inifinity
+			.quad 0x07f8000007F800000
+
+.L__mask_ninf:		.quad 0x0ff800000fF800000	# -inifinity
+			.quad 0x0ff800000fF800000
+
+.L__mask_NaN:		.quad 0x07fC000007FC00000	# NaN
+			.quad 0x07fC000007FC00000
+
+.L__mask_sigbit:	.quad 0x00040000000400000	# QNaN bit
+			.quad 0x00040000000400000
+
+.L__mask_impbit:	.quad 0x00080000000800000	# implicit bit
+			.quad 0x00080000000800000
+
+.L__mask_ly:		.quad 0x04f0000004f000000	# large y
+			.quad 0x04f0000004f000000
+
+
+

diff --git a/src/gas/vrsasincosf.S b/src/gas/vrsasincosf.S
new file mode 100644
index 0000000..2bb70bf
--- /dev/null
+++ b/src/gas/vrsasincosf.S

@@ -0,0 +1,2008 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsasincosf.s
+#
+# A vector implementation of the sincos libm function.
+#
+# Prototype:
+#
+#    __vrsa_sincosf(int n, float* x, float* ys, float* yc);
+#
+# Computes Sine and Cosine of x for an array of input values.
+# Places the Sine results into the supplied ys array and the Cosine results into the supplied yc array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 single precision Sine Cosine values at a time.
+# The four values are passed as packed single in xmm0.
+# The four Sine results are returned as packed singles in the supplied ys array.
+# The four Cosine results are returned as packed singles in the supplied yc array.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+
+# Author: Harsha Jagasia
+# Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff	#Sign bit zero
+			.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000	# 1.0
+			.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+			.quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000	# 0.5
+			.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555	# 0.166666666666
+			.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883	# twobypi
+			.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000	# piby2_1
+			.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331	# piby2_1tail
+			.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000	# piby2_2
+			.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073	# piby2_2tail
+			.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000	# mask for stripping head and tail
+			.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+			.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+			.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+			.quad 0
+.L__reald_one_zero:	.quad 0x00000000100000000	# sin_cos_filter
+			.quad 0
+.L__reald_zero_one:	.quad 0x00000000000000001	#
+			.quad 0
+.L__reald_two_zero:	.quad 0x00000000200000000	#
+			.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+			.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+			.quad 0x03ff0000000000000	#
+.L__real_zero:		.quad 0x00000000000000000	#
+			.quad 0x00000000000000000	#
+.L__real_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+
+.Lcosarray:
+	.quad	0x03FA5555555502F31		#  0.0416667			c1
+	.quad	0x03FA5555555502F31
+	.quad	0x0BF56C16BF55699D7		# -0.00138889			c2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x03EFA015C50A93B49		#  2.48016e-005			c3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x0BE92524743CC46B8		# -2.75573e-007			c4
+	.quad	0x0BE92524743CC46B8
+
+.Lsinarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BFC555555545E87D
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x03F811110DF01232D
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x0BE92524743CC46B8
+
+.Lcossinarray:
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BF56C16BF55699D7		#				c2
+	.quad	0x03F811110DF01232D
+	.quad	0x03EFA015C50A93B49		#				c3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x0BE92524743CC46B8		#				c4
+	.quad	0x03EC6DBE4AD1572D5
+
+.align 8
+	.Levensin_oddcos_tbl:
+
+		.quad	.Lsinsin_sinsin_piby4		# 0		*	; Done
+		.quad	.Lsinsin_sincos_piby4		# 1		+	; Done
+		.quad	.Lsinsin_cossin_piby4		# 2			; Done
+		.quad	.Lsinsin_coscos_piby4		# 3		+	; Done
+
+		.quad	.Lsincos_sinsin_piby4		# 4			; Done
+		.quad	.Lsincos_sincos_piby4		# 5		*	; Done
+		.quad	.Lsincos_cossin_piby4		# 6			; Done
+		.quad	.Lsincos_coscos_piby4		# 7			; Done
+
+		.quad	.Lcossin_sinsin_piby4		# 8			; Done
+		.quad	.Lcossin_sincos_piby4		# 9			; TBD
+		.quad	.Lcossin_cossin_piby4		# 10		*	; Done
+		.quad	.Lcossin_coscos_piby4		# 11			; Done
+
+		.quad	.Lcoscos_sinsin_piby4		# 12			; Done
+		.quad	.Lcoscos_sincos_piby4		# 13		+	; Done
+		.quad	.Lcoscos_cossin_piby4		# 14			; Done
+		.quad	.Lcoscos_coscos_piby4		# 15		*	; Done
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        .weak vrsa_sincosf_
+        .set vrsa_sincosf_,__vrsa_sincosf__
+        .weak vrsa_sincosf__
+        .set vrsa_sincosf__,__vrsa_sincosf__
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#FORTRAN subroutine implementation of array sincos
+#VRSA_SINCOSF(N,X,Y,Z)
+#C equivalent*/
+#void vrsa_sincosf__(int * n, double *x, double *y, double *z)
+#{
+#       vrsa_sincosf(*n,x,y,z);
+#}
+
+.globl __vrsa_sincosf__
+    .type   __vrsa_sincosf__,@function
+__vrsa_sincosf__:
+    mov         (%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ	p_temp,0		# temporary for get/put bits operation
+.equ	p_temp1,0x10		# temporary for get/put bits operation
+
+.equ	save_xmm6,0x20		# temporary for get/put bits operation
+.equ	save_xmm7,0x30		# temporary for get/put bits operation
+.equ	save_xmm8,0x40		# temporary for get/put bits operation
+.equ	save_xmm9,0x50		# temporary for get/put bits operation
+.equ	save_xmm0,0x60		# temporary for get/put bits operation
+.equ	save_xmm11,0x70		# temporary for get/put bits operation
+.equ	save_xmm12,0x80		# temporary for get/put bits operation
+.equ	save_xmm13,0x90		# temporary for get/put bits operation
+.equ	save_xmm14,0x0A0		# temporary for get/put bits operation
+.equ	save_xmm15,0x0B0		# temporary for get/put bits operation
+
+.equ	r,0x0C0			# pointer to r for remainder_piby2
+.equ	rr,0x0D0		# pointer to r for remainder_piby2
+.equ	region,0x0E0		# pointer to r for remainder_piby2
+
+.equ	r1,0x0F0		# pointer to r for remainder_piby2
+.equ	rr1,0x0100		# pointer to r for remainder_piby2
+.equ	region1,0x0110		# pointer to r for remainder_piby2
+
+.equ	p_temp2,0x0120		# temporary for get/put bits operation
+.equ	p_temp3,0x0130		# temporary for get/put bits operation
+
+.equ	p_temp4,0x0140		# temporary for get/put bits operation
+.equ	p_temp5,0x0150		# temporary for get/put bits operation
+
+.equ	p_original,0x0160		# original x
+.equ	p_mask,0x0170		# original x
+.equ	p_sign_sin,0x0180		# original x
+
+.equ	p_original1,0x0190		# original x
+.equ	p_mask1,0x01A0		# original x
+.equ	p_sign1_sin,0x01B0		# original x
+
+
+.equ	save_r12,0x01C0		# temporary for get/put bits operation
+.equ	save_r13,0x01D0		# temporary for get/put bits operation
+
+.equ	p_sin,0x01E0		# sin
+.equ	p_cos,0x01F0		# cos
+
+.equ	save_rdi,0x0200		# temporary for get/put bits operation
+.equ	save_rsi,0x0210		# temporary for get/put bits operation
+
+.equ	p_sign_cos,0x0220		# Sign of lower cos term
+.equ	p_sign1_cos,0x0230		# Sign of upper cos term
+
+.equ	save_xa,0x0240		#qword ; leave space for 4 args*****
+.equ	save_ysa,0x0250		#qword ; leave space for 4 args*****
+.equ	save_yca,0x0260		#qword ; leave space for 4 args*****
+
+.equ	save_nv,0x0270		#qword
+.equ	p_iter,0x0280		#qword	storage for number of loop iterations
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.globl vrsa_sincosf
+    .type   vrsa_sincosf,@function
+vrsa_sincosf:
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# parameters are passed in by Linux as:
+# rcx - int n
+# rdx - double *x
+# r8  - double *y
+
+	sub		$0x0298,%rsp
+	mov		%r12,save_r12(%rsp)	# save r12
+	mov		%r13,save_r13(%rsp)	# save r13
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START PROCESS INPUT
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ysa(%rsp)	# save ysin_array pointer
+	mov		%rcx,save_yca(%rsp)	# save ycos_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$2,%rax				# get number of iterations
+	jz		.L__vrsa_cleanup			# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi		# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrsa_top:
+# build the input _m128d
+#	movapd	.L__real_7fffffffffffffff,%xmm2	#
+#	mov	.L__real_7fffffffffffffff,%rdx	#
+
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movlps	(%rsi),%xmm0
+	movhps	8(%rsi),%xmm0
+
+	prefetch	32(%rsi)
+	add		$16,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+	movhlps		%xmm0,%xmm8
+	cvtps2pd	%xmm0,%xmm10			# convert input to double.
+	cvtps2pd	%xmm8,%xmm1			# convert input to double.
+
+movdqa	%xmm10,%xmm6
+movdqa	%xmm1,%xmm7
+movapd	.L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd 	%xmm2,%xmm10				#Unsign
+andpd 	%xmm2,%xmm1				#Unsign
+
+mov	%rdi, p_sin(%rsp)			# save address for sin return
+mov	%rsi,  p_cos(%rsp)			# save address for cos return
+
+movd	%xmm10,%rax				#rax is lower arg
+movhpd	%xmm10, p_temp+8(%rsp)			#
+mov    	p_temp+8(%rsp),%rcx			#rcx = upper arg
+
+movd	%xmm1,%r8				#r8 is lower arg
+movhpd	%xmm1, p_temp1+8(%rsp)			#
+mov    	p_temp1+8(%rsp),%r9			#r9 = upper arg
+
+movdqa	%xmm10,%xmm12
+movdqa	%xmm1,%xmm13
+
+pcmpgtd		%xmm6,%xmm12
+pcmpgtd		%xmm7,%xmm13
+movdqa		%xmm12,%xmm6
+movdqa		%xmm13,%xmm7
+psrldq		$4,%xmm12
+psrldq		$4,%xmm13
+psrldq		$8,%xmm6
+psrldq		$8,%xmm7
+
+mov 	$0x3FE921FB54442D18,%rdx			#piby4	+
+mov	$0x411E848000000000,%r10			#5e5	+
+movapd	.L__real_3fe0000000000000(%rip),%xmm4		#0.5 for later use +
+
+por	%xmm6,%xmm12
+por	%xmm7,%xmm13
+
+movd	%xmm12,%r12				#Move Sign to gpr **
+movd	%xmm13,%r13				#Move Sign to gpr **
+
+movapd	%xmm10,%xmm2				#x0
+movapd	%xmm1,%xmm3				#x1
+movapd	%xmm10,%xmm6				#x0
+movapd	%xmm1,%xmm7				#x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax
+	jae	.Lfirst_or_next3_arg_gt_5e5
+
+	cmp	%r10,%rcx
+	jae	.Lsecond_or_next2_arg_gt_5e5
+
+	cmp	%r10,%r8
+	jae	.Lthird_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfourth_arg_gt_5e5
+
+
+#      /* Find out what multiple of piby2 */
+#        npi2  = (int)(x * twobypi + 0.5);
+	movapd	.L__real_3fe45f306dc9c883(%rip),%xmm10
+	mulpd	%xmm10,%xmm2						# * twobypi
+	mulpd	%xmm10,%xmm3						# * twobypi
+
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%r8						# Region
+	movd	%xmm5,%r9						# Region
+
+#DELETE
+#	mov 	.LQWORD,%rdx PTR __reald_one_zero			;compare value for cossin path
+#DELETE
+
+	mov	%r8,%r10
+	mov	%r9,%r11
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm1,%xmm7						# t-rhead
+
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign
+# xmm5  = npi2 (int), xmm1 =rhead, xmm9 =rtail,  r9 = region, r11 = region, r13 = Sign
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+# NEW
+
+	#ADDED
+	mov	%r10,%rdi				# npi2 in int
+	mov	%r11,%rsi				# npi2 in int
+	#ADDED
+
+	shr	$1,%r10					# 0 and 1 => 0
+	shr	$1,%r11					# 2 and 3 => 1
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	#ADDED
+	xor	%r10,%rdi				# xor last 2 bits of region for cos
+	xor	%r11,%rsi				# xor last 2 bits of region for cos
+	#ADDED
+
+	not 	%r12					#~(sign)
+	not 	%r13					#~(sign)
+	and	%r12,%r10				#region & ~(sign)
+	and	%r13,%r11				#region & ~(sign)
+
+	not	%rax					#~(region)
+	not	%rcx					#~(region)
+	not	%r12					#~~(sign)
+	not	%r13					#~~(sign)
+	and	%r12,%rax				#~region & ~~(sign)
+	and	%r13,%rcx				#~region & ~~(sign)
+
+	#ADDED
+	and	.L__reald_one_one(%rip),%rdi		# sign for cos
+	and	.L__reald_one_one(%rip),%rsi		# sign for cos
+	#ADDED
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10		# sign for sin
+	and	.L__reald_one_one(%rip),%r11		# sign for sin
+
+
+
+
+
+
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	#ADDED
+	mov	%rdi,%rax
+	mov	%rsi,%rcx
+	#ADDED
+
+	and	.L__reald_one_zero(%rip),%r12		#mask out the lower sign bit leaving the upper sign bit
+	and	.L__reald_one_zero(%rip),%r13		#mask out the lower sign bit leaving the upper sign bit
+
+	#ADDED
+	and	.L__reald_one_zero(%rip),%rax		#mask out the lower sign bit leaving the upper sign bit
+	and	.L__reald_one_zero(%rip),%rcx		#mask out the lower sign bit leaving the upper sign bit
+	#ADDED
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	#ADDED
+	shl	$63,%rdi				#shift lower sign bit left by 63 bits
+	shl	$63,%rsi				#shift lower sign bit left by 63 bits
+	shl	$31,%rax				#shift upper sign bit left by 31 bits
+	shl	$31,%rcx				#shift upper sign bit left by 31 bits
+	#ADDED
+
+	mov 	 %r10,p_sign_sin(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign_sin+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1_sin(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1_sin+8(%rsp)	#write out upper sign bit
+
+	mov 	 %rdi,p_sign_cos(%rsp)		#write out lower sign bit
+	mov 	 %rax,p_sign_cos+8(%rsp)		#write out upper sign bit
+	mov 	 %rsi,p_sign1_cos(%rsp)		#write out lower sign bit
+	mov 	 %rcx,p_sign1_cos+8(%rsp)	#write out upper sign bit
+
+# NEW
+
+# GET_BITS_DP64(rhead-rtail, uy);			   		; originally only rhead
+# xmm4  = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 =rhead, xmm9 =rtail
+	movapd	%xmm10,%xmm6						# rhead
+	movapd	%xmm1,%xmm7						# rhead
+
+	subpd	%xmm8,%xmm10						# r = rhead - rtail
+	subpd	%xmm9,%xmm1						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+#	subpd	%xmm10,%xmm6				;rr=rhead-r
+#	subpd	%xmm1,%xmm7				;rr=rhead-r
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2				# move r for r2
+	movapd	%xmm1,%xmm3				# move r for r2
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+#	subpd	xmm6, xmm8				;rr=(rhead-r) -rtail
+#	subpd	xmm7, xmm9				;rr=(rhead-r) -rtail
+
+	and	.L__reald_zero_one(%rip),%rax		# region for jump table
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+
+# HARSHA ADDED
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_sin   = Sign, p_sign_cos = Sign,  xmm10 = r, xmm2 = r2
+# p_sign1_sin  = Sign, p_sign1_cos = Sign, xmm1 = r,  xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm14					# for x3
+	movapd	%xmm3,%xmm15					# for x3
+
+	movapd	%xmm2,%xmm0					# for r
+	movapd	%xmm3,%xmm11					# for r
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm6			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm7			# c4
+
+	movapd	.Lsinarray+0x10(%rip),%xmm12			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm13			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm10,%xmm14					# x3
+	mulpd	%xmm1,%xmm15					# x3
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm6					# c2*x2
+	mulpd	%xmm3,%xmm7					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# c4*x2
+	mulpd	%xmm3,%xmm13					# c4*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0	;trash r
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	addpd	.Lcosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9			# c1+x2c2
+
+	addpd	.Lsinarray+0x20(%rip),%xmm6			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm7			# c3+x2c4
+
+	addpd	.Lsinarray(%rip),%xmm12			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm13			# c1+x2c2
+
+	mulpd	%xmm2,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm5					# x4(c3+x2c4)
+
+	mulpd	%xmm2,%xmm6					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm7					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	addpd	%xmm12,%xmm6					# zs
+	addpd	%xmm13,%xmm7					# zs
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	mulpd	%xmm14,%xmm6					# x3 * zs
+	mulpd	%xmm15,%xmm7					# x3 * zs
+
+	subpd   %xmm0,%xmm4					# - (-t)
+	subpd   %xmm11,%xmm5					# - (-t)
+
+	addpd	%xmm10,%xmm6					# +x
+	addpd	%xmm1,%xmm7					# +x
+
+# HARSHA ADDED
+
+	lea	.Levensin_oddcos_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)					#Jmp table for cos/sin calculation based on even/odd region
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+#	    %xmm11,,%xmm9 xmm13
+
+
+	movlpd	 %xmm10,r(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm10,%xmm10			#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm8 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm12 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	%xmm6,%xmm10
+       subsd	%xmm0,%xmm10					# xmm10 = r=(rhead-rtail)
+       subsd	%xmm10,%xmm6					# rr=rhead-r
+       subsd	%xmm0,%xmm6					# xmm6 = rr=((rhead-r) -rtail)
+       movlpd	 %xmm10,r+8(%rsp)			# store upper r
+       movlpd	 %xmm6,rr+8(%rsp)			# store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_lower_naninf
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r(%rsp),%rdi				#Restore lower fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_sincosf_lower_naninf:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			# region =0
+
+.align 16
+0:
+
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+	movhlps %xmm10,%xmm6		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+	movd	%xmm10,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	jmp 	0f
+
+.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5
+
+
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm6,%rdi				#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+#   %xmm9,,%xmm5 xmm11, xmm13
+
+	movhpd	 %xmm10,r+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm7 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store upper region
+
+#       movsd	%xmm6,%xmm10
+#       subsd	xmm10,xmm0					; xmm10 = r=(rhead-rtail)
+#       subsd	%xmm10,%xmm6					; rr=rhead-r
+#       subsd	xmm6, xmm0					; xmm6 = rr=((rhead-r) -rtail)
+
+        subsd	%xmm0,%xmm6					# xmm10 = r=(rhead-rtail)
+
+#       movlpd	QWORD PTR r[rsp], xmm10				; store upper r
+#       movlpd	QWORD PTR rr[rsp], xmm6				; store upper rr
+
+        movlpd	 %xmm6,r(%rsp)				# store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_upper_naninf
+
+	mov	 %r8,p_temp(%rsp)
+	mov	 %r9,p_temp2(%rsp)
+	movapd	 %xmm1,p_temp1(%rsp)
+	movapd	 %xmm3,p_temp3(%rsp)
+	movapd	 %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r+8(%rsp),%rdi				#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	p_temp(%rsp),%r8
+	mov	p_temp2(%rsp),%r9
+	movapd	p_temp1(%rsp),%xmm1
+	movapd	p_temp3(%rsp),%xmm3
+	movapd	p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_sincosf_upper_naninf:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			# region =0
+
+.align 16
+0:
+
+	jmp	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r8
+	jae	.Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4			#Restore 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm3						# * twobypi
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm5,region1(%rsp)						# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm1,%xmm7						# t-rhead
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm1,%xmm7						; rhead
+       subpd	%xmm9,%xmm1						# r = rhead - rtail
+       movapd	 %xmm1,r1(%rsp)
+
+#       subpd	%xmm1,%xmm7						; rr=rhead-r
+#       subpd	xmm7, xmm9						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr1[rsp], xmm7
+
+	jmp	.L__vrs4_sincosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use 	%xmm11,,%xmm9 xmm13
+# 	%xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm10,%xmm6						; rhead
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	 %xmm10,r(%rsp)
+
+#       subpd	%xmm10,%xmm6						; rr=rhead-r
+#       subpd	xmm6, xmm8						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+	cmp	%r10,%r9
+	jae	.Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movlpd	 %xmm1,r1(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm1,%xmm1			#Needed since we want to work on upper arg
+	movhlps	%xmm3,%xmm3
+	movhlps	%xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r9d				# r9d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r9d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r9d,region1+4(%rsp)			# store upper region
+
+
+       subsd	%xmm10,%xmm7					# xmm1 = r=(rhead-rtail)
+
+       movlpd	 %xmm7,r1+8(%rsp)			# store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_lower_naninf_higher
+
+	lea	 region1(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r1(%rsp),%rdi				#Restore lower fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sincosf_lower_naninf_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.L__vrs4_sincosf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+	movhlps %xmm1,%xmm7		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher
+
+	mov	  %r9,p_temp1(%rsp)			#Save upper arg
+	lea	  region1(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	  r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm1,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp1(%rsp),%r9			#Restore upper arg
+
+
+	jmp 	0f
+
+.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher:				#lower arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher
+
+	lea	 region1+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm7,%rdi			#Restore upper fp arg for remainder_piby2 call
+
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			#region = 0
+
+.align 16
+0:
+
+	jmp 	.L__vrs4_sincosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm10,%xmm6						; rhead
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	 %xmm10,r(%rsp)
+
+#       subpd	%xmm10,%xmm6						; rr=rhead-r
+#       subpd	xmm6, xmm8						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd	 %xmm1,r1+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r8d				# r8d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r8d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r8d,region1(%rsp)			# store lower region
+
+#       movsd	%xmm7,%xmm1
+#       subsd	xmm1, xmm10					; xmm10 = r=(rhead-rtail)
+#       subsd	%xmm1,%xmm7					; rr=rhead-r
+#       subsd	xmm7, xmm10					; xmm6 = rr=((rhead-r) -rtail)
+
+        subsd	%xmm10,%xmm7					# xmm10 = r=(rhead-rtail)
+
+#       movlpd	QWORD PTR r1[rsp], xmm1				; store upper r
+#       movlpd	QWORD PTR rr1[rsp], xmm7			; store upper rr
+
+        movlpd	 %xmm7,r1(%rsp)				# store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sincosf_upper_naninf_higher
+
+	lea	 region1+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r1+8(%rsp),%rdi				#Restore upper fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sincosf_upper_naninf_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp	.L__vrs4_sincosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sincosf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_sin  = Sign,  ; p_sign_cos   = Sign, xmm10 = r, xmm2 = r2
+# p_sign1_sin  = Sign, ; p_sign1_cos  = Sign, xmm1 = r, xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	r(%rsp),%xmm10
+	movapd	r1(%rsp),%xmm1
+
+	mov	region(%rsp),%r8
+	mov	region1(%rsp),%r9
+
+	mov 	%r8,%r10
+	mov 	%r9,%r11
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+
+# NEW
+
+	#ADDED
+	mov	%r10,%rdi
+	mov	%r11,%rsi
+	#ADDED
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	#ADDED
+	xor	%r10,%rdi
+	xor	%r11,%rsi
+	#ADDED
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	#ADDED
+	and	.L__reald_one_one(%rip),%rdi				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%rsi				#(~AB+A~B)&1
+	#ADDED
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+
+
+
+
+
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	#ADDED
+	mov	%rdi,%rax
+	mov	%rsi,%rcx
+	#ADDED
+
+	and	.L__reald_one_zero(%rip),%r12		#mask out the lower sign bit leaving the upper sign bit
+	and	.L__reald_one_zero(%rip),%r13		#mask out the lower sign bit leaving the upper sign bit
+
+	#ADDED
+	and	.L__reald_one_zero(%rip),%rax		#mask out the lower sign bit leaving the upper sign bit
+	and	.L__reald_one_zero(%rip),%rcx		#mask out the lower sign bit leaving the upper sign bit
+	#ADDED
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	#ADDED
+	shl	$63,%rdi				#shift lower sign bit left by 63 bits
+	shl	$63,%rsi				#shift lower sign bit left by 63 bits
+	shl	$31,%rax				#shift upper sign bit left by 31 bits
+	shl	$31,%rcx				#shift upper sign bit left by 31 bits
+	#ADDED
+
+	mov 	 %r10,p_sign_sin(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign_sin+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1_sin(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1_sin+8(%rsp)	#write out upper sign bit
+
+	mov 	 %rdi,p_sign_cos(%rsp)		#write out lower sign bit
+	mov 	 %rax,p_sign_cos+8(%rsp)		#write out upper sign bit
+	mov 	 %rsi,p_sign1_cos(%rsp)		#write out lower sign bit
+	mov 	 %rcx,p_sign1_cos+8(%rsp)	#write out upper sign bit
+#NEW
+
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+
+# HARSHA ADDED
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_cos  = Sign, p_sign_sin  = Sign, xmm10 = r, xmm2 = r2
+# p_sign1_cos = Sign, p_sign1_sin = Sign, xmm1 = r,  xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm14					# for x3
+	movapd	%xmm3,%xmm15					# for x3
+
+	movapd	%xmm2,%xmm0					# for r
+	movapd	%xmm3,%xmm11					# for r
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm6			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm7			# c4
+
+	movapd	.Lsinarray+0x10(%rip),%xmm12			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm13			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm10,%xmm14					# x3
+	mulpd	%xmm1,%xmm15					# x3
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm6					# c2*x2
+	mulpd	%xmm3,%xmm7					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# c4*x2
+	mulpd	%xmm3,%xmm13					# c4*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0	;trash r
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	addpd	.Lcosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9			# c1+x2c2
+
+	addpd	.Lsinarray+0x20(%rip),%xmm6			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm7			# c3+x2c4
+
+	addpd	.Lsinarray(%rip),%xmm12			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm13			# c1+x2c2
+
+	mulpd	%xmm2,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm5					# x4(c3+x2c4)
+
+	mulpd	%xmm2,%xmm6					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm7					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	addpd	%xmm12,%xmm6					# zs
+	addpd	%xmm13,%xmm7					# zs
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	mulpd	%xmm14,%xmm6					# x3 * zs
+	mulpd	%xmm15,%xmm7					# x3 * zs
+
+	subpd   %xmm0,%xmm4					# - (-t)
+	subpd   %xmm11,%xmm5					# - (-t)
+
+	addpd	%xmm10,%xmm6					# +x
+	addpd	%xmm1,%xmm7					# +x
+
+# HARSHA ADDED
+
+
+	lea	.Levensin_oddcos_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)					#Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrsa_sincosf_cleanup:
+
+	movapd	p_sign_cos(%rsp),%xmm10
+	movapd	p_sign1_cos(%rsp),%xmm1
+	xorpd	%xmm4,%xmm10			# Cos term   (+) Sign
+	xorpd	%xmm5,%xmm1			# Cos term   (+) Sign
+
+	cvtpd2ps %xmm10,%xmm0
+	cvtpd2ps %xmm1,%xmm11
+
+	movapd	p_sign_sin(%rsp),%xmm14
+	movapd	p_sign1_sin(%rsp),%xmm15
+	xorpd	%xmm6,%xmm14			# Sin term (+) Sign
+	xorpd	%xmm7,%xmm15			# Sin term (+) Sign
+
+	cvtpd2ps %xmm14,%xmm12
+	cvtpd2ps %xmm15,%xmm13
+
+
+.L__vrsa_bottom1:
+# store the result _m128d
+
+	mov	save_ysa(%rsp),%r8
+	mov	save_yca(%rsp),%r9
+
+	movlps	 %xmm0,  (%r9)			# save the cos
+	movlps	 %xmm12, (%r8)			# save the sin
+	movlps	 %xmm11, 8(%r9)			# save the cos
+	movlps	 %xmm13, 8(%r8)			# save the sin
+
+
+	prefetch	32(%r8)
+	prefetch	32(%r9)
+
+	add		$16,%r8
+	add		$16,%r9
+
+	mov		%r8,save_ysa(%rsp)	# save y_sinarray pointer
+	mov		%r9,save_yca(%rsp)	# save y_cosarray pointer
+
+	mov	p_iter(%rsp),%rax		# get number of iterations
+	sub	$1,%rax
+	mov	%rax,p_iter(%rsp)		# save number of iterations
+	jnz	.L__vrsa_top
+
+# see if we need to do any extras
+	mov	save_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz	.L__vrsa_cleanup
+
+.L__final_check:
+
+	mov	save_r12(%rsp),%r12	# restore r12
+	mov	save_r13(%rsp),%r13	# restore r13
+
+	add	$0x0298,%rsp
+	ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# we assume that rdx is pointing at the next x array element, r8 at the next y array element.
+# The number of values left is in save_nv
+
+.align	16
+.L__vrsa_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__final_check         # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ysa(%rsp),%rdi
+	mov		save_yca(%rsp),%r12
+
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorps		 %xmm0,%xmm0
+	movss		 %xmm0,p_temp+4(%rsp)
+	movlps		 %xmm0,p_temp+8(%rsp)
+
+
+	mov		 (%rsi),%ecx			# we know there's at least one
+	mov	 	 %ecx,p_temp(%rsp)
+	cmp		 $2,%rax
+	jl		 .L__vrsacg
+
+	mov		 4(%rsi),%ecx			# do the second value
+	mov	 	 %ecx,p_temp+4(%rsp)
+	cmp		 $3,%rax
+	jl		 .L__vrsacg
+
+	mov		 8(%rsi),%ecx			# do the third value
+	mov	 	 %ecx,p_temp+8(%rsp)
+
+.L__vrsacg:
+	mov		$4,%rdi				# parameter for N
+	lea		p_temp(%rsp),%rsi		# &x parameter
+	lea		p_temp2(%rsp),%rdx	 	# &ys parameter
+	lea		p_temp3(%rsp),%rcx		# &yc parameter
+	call		vrsa_sincosf@PLT		# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		save_ysa(%rsp),%rdi
+	mov		save_yca(%rsp),%r12
+	mov		save_nv(%rsp),%rax			# get number of values
+
+	mov	 	p_temp2(%rsp),%ecx
+	mov		%ecx,(%rdi)			# we know there's at least one
+	mov	 	p_temp3(%rsp),%edx
+	mov		%edx,(%r12)			# we know there's at least one
+	cmp		$2,%rax
+	jl		.L__vrsacgf
+
+	mov	 	p_temp2+4(%rsp),%ecx
+	mov		%ecx,4(%rdi)			# do the second value
+	mov	 	p_temp3+4(%rsp),%edx
+	mov		%edx,4(%r12)			# do the second value
+	cmp		$3,%rax
+	jl		.L__vrsacgf
+
+	mov	 	p_temp2+8(%rsp),%ecx
+	mov		%ecx,8(%rdi)			# do the third value
+	mov	 	p_temp3+8(%rsp),%edx
+	mov		%edx,8(%r12)			# do the third value
+
+.L__vrsacgf:
+	jmp		.L__final_check
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 16
+.Lcoscos_coscos_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower and Upper Even
+
+	movapd	%xmm4,%xmm8
+	movapd	%xmm5,%xmm9
+
+	movapd	%xmm6,%xmm4
+	movapd	%xmm7,%xmm5
+
+	movapd	%xmm8,%xmm6
+	movapd	%xmm9,%xmm7
+
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+
+	movhlps	%xmm5,%xmm9
+	movhlps	%xmm7,%xmm13
+
+	movlhps	%xmm9,%xmm7
+	movlhps	%xmm13,%xmm5
+
+	movhlps	%xmm4,%xmm8
+	movhlps	%xmm6,%xmm12
+
+	movlhps	%xmm8,%xmm6
+	movlhps	%xmm12,%xmm4
+
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsincos_cossin_piby4:
+	movsd	%xmm5,%xmm9
+	movsd	%xmm7,%xmm13
+
+	movsd	%xmm9,%xmm7
+	movsd	%xmm13,%xmm5
+
+	movhlps	%xmm4,%xmm8
+	movhlps	%xmm6,%xmm12
+
+	movlhps	%xmm8,%xmm6
+	movlhps	%xmm12,%xmm4
+
+	jmp	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+	movsd	%xmm5,%xmm9
+	movsd	%xmm7,%xmm13
+
+	movsd	%xmm9,%xmm7
+	movsd	%xmm13,%xmm5
+
+	movsd	%xmm4,%xmm8
+	movsd	%xmm6,%xmm12
+
+	movsd	%xmm8,%xmm6
+	movsd	%xmm12,%xmm4
+
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+	movhlps	%xmm5,%xmm9
+	movhlps	%xmm7,%xmm13
+
+	movlhps	%xmm9,%xmm7
+	movlhps	%xmm13,%xmm5
+
+	movsd	%xmm4,%xmm8
+	movsd	%xmm6,%xmm12
+
+	movsd	%xmm8,%xmm6
+	movsd	%xmm12,%xmm4
+
+	jmp	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcoscos_sinsin_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower even, Upper odd, Swap upper
+
+	movapd	%xmm5,%xmm9
+	movapd	%xmm7,%xmm5
+	movapd	%xmm9,%xmm7
+
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower odd, Upper even, Swap lower
+
+	movapd	%xmm4,%xmm8
+	movapd	%xmm6,%xmm4
+	movapd	%xmm8,%xmm6
+
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcoscos_cossin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+	movapd	%xmm5,%xmm9
+	movapd	%xmm7,%xmm5
+	movapd	%xmm9,%xmm7
+
+	movhlps	%xmm4,%xmm8
+	movhlps	%xmm6,%xmm12
+
+	movlhps	%xmm8,%xmm6
+	movlhps	%xmm12,%xmm4
+
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+	movapd	%xmm5,%xmm9
+	movapd	%xmm7,%xmm5
+	movapd	%xmm9,%xmm7
+
+	movsd	%xmm4,%xmm8
+	movsd	%xmm6,%xmm12
+
+	movsd	%xmm8,%xmm6
+	movsd	%xmm12,%xmm4
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+	movapd	%xmm4,%xmm8
+	movapd	%xmm6,%xmm4
+	movapd	%xmm8,%xmm6
+
+	movhlps	%xmm5,%xmm9
+	movhlps	%xmm7,%xmm13
+
+	movlhps	%xmm9,%xmm7
+	movlhps	%xmm13,%xmm5
+
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+	movhlps	%xmm5,%xmm9
+	movhlps	%xmm7,%xmm13
+
+	movlhps	%xmm9,%xmm7
+	movlhps	%xmm13,%xmm5
+
+	jmp 	.L__vrsa_sincosf_cleanup
+
+
+.align 16
+.Lsincos_coscos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+	movapd	%xmm4,%xmm8
+	movapd	%xmm6,%xmm4
+	movapd	%xmm8,%xmm6
+
+	movsd	%xmm5,%xmm9
+	movsd	%xmm7,%xmm13
+
+	movsd	%xmm9,%xmm7
+	movsd	%xmm13,%xmm5
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+	movsd	%xmm5,%xmm9
+	movsd	%xmm7,%xmm5
+	movsd	%xmm9,%xmm7
+
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+	movhlps	%xmm4,%xmm8
+	movhlps	%xmm6,%xmm12
+
+	movlhps	%xmm8,%xmm6
+	movlhps	%xmm12,%xmm4
+
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+	movsd	%xmm4,%xmm8
+	movsd	%xmm6,%xmm4
+	movsd	%xmm8,%xmm6
+	jmp 	.L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+# Lower and Upper odd, So Swap
+
+	jmp 	.L__vrsa_sincosf_cleanup

diff --git a/src/gas/vrsasinf.S b/src/gas/vrsasinf.S
new file mode 100644
index 0000000..6cbff59
--- /dev/null
+++ b/src/gas/vrsasinf.S

@@ -0,0 +1,2441 @@
+
+#
+#  (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+#  This file is part of libacml_mv.
+#
+#  libacml_mv is free software; you can redistribute it and/or
+#  modify it under the terms of the GNU Lesser General Public
+#  License as published by the Free Software Foundation; either
+#  version 2.1 of the License, or (at your option) any later version.
+#
+#  libacml_mv is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+#  Lesser General Public License for more details.
+#
+#  You should have received a copy of the GNU Lesser General Public
+#  License along with libacml_mv.  If not, see
+#  <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsasinf.s
+#
+# A vector implementation of the sin libm function.
+#
+# Prototype:
+#
+#    vrsa_sinf(int n, float* x, float* y);
+#
+# Computes Sine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This inlines a routine that computes 4 single precision Sine values at a time.
+# The four values are passed as packed single in xmm10.
+# The four results are returned as packed singles in xmm10.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.  Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory.  This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# Author: Harsha Jagasia
+# Email:  harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff	#Sign bit zero
+			.quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000	# 1.0
+			.quad 0x03ff0000000000000
+.L__real_v2p__27:		.quad 0x03e40000000000000	# 2p-27
+			.quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000	# 0.5
+			.quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555	# 0.166666666666
+			.quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883	# twobypi
+			.quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000	# piby2_1
+			.quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331	# piby2_1tail
+			.quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000	# piby2_2
+			.quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073	# piby2_2tail
+			.quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000	# mask for stripping head and tail
+			.quad 0x0fffffffff8000000
+.L__real_8000000000000000:	.quad 0x08000000000000000	# -0  or signbit
+			.quad 0x08000000000000000
+.L__reald_one_one:		.quad 0x00000000100000001	#
+			.quad 0
+.L__reald_two_two:		.quad 0x00000000200000002	#
+			.quad 0
+.L__reald_one_zero:	.quad 0x00000000100000000	# sin_cos_filter
+			.quad 0
+.L__reald_zero_one:	.quad 0x00000000000000001	#
+			.quad 0
+.L__reald_two_zero:	.quad 0x00000000200000000	#
+			.quad 0
+.L__realq_one_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+.L__realq_two_two:		.quad 0x00000000000000002	#
+			.quad 0x00000000000000002	#
+.L__real_1_x_mask:		.quad 0x0ffffffffffffffff	#
+			.quad 0x03ff0000000000000	#
+.L__real_zero:		.quad 0x00000000000000000	#
+			.quad 0x00000000000000000	#
+.L__real_one:		.quad 0x00000000000000001	#
+			.quad 0x00000000000000001	#
+
+.Lcosarray:
+	.quad	0x03FA5555555502F31		#  0.0416667			c1
+	.quad	0x03FA5555555502F31
+	.quad	0x0BF56C16BF55699D7		# -0.00138889			c2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x03EFA015C50A93B49		#  2.48016e-005			c3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x0BE92524743CC46B8		# -2.75573e-007			c4
+	.quad	0x0BE92524743CC46B8
+
+.Lsinarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BFC555555545E87D
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x03F811110DF01232D
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x03F811110DF01232D		# 0.00833333	   		s2
+	.quad	0x0BF56C16BF55699D7
+	.quad	0x0BF2A013A88A37196		# -0.000198413			s3
+	.quad	0x03EFA015C50A93B49
+	.quad	0x03EC6DBE4AD1572D5		# 2.75573e-006			s4
+	.quad	0x0BE92524743CC46B8
+
+.Lcossinarray:
+	.quad	0x03FA5555555502F31		# 0.0416667		   	c1
+	.quad	0x0BFC555555545E87D		# -0.166667	   		s1
+	.quad	0x0BF56C16BF55699D7		#				c2
+	.quad	0x03F811110DF01232D
+	.quad	0x03EFA015C50A93B49		#				c3
+	.quad	0x0BF2A013A88A37196
+	.quad	0x0BE92524743CC46B8		#				c4
+	.quad	0x03EC6DBE4AD1572D5
+
+.align 8
+	.Levensin_oddcos_tbl:
+
+		.quad	.Lsinsin_sinsin_piby4		# 0		*	; Done
+		.quad	.Lsinsin_sincos_piby4		# 1		+	; Done
+		.quad	.Lsinsin_cossin_piby4		# 2			; Done
+		.quad	.Lsinsin_coscos_piby4		# 3		+	; Done
+
+		.quad	.Lsincos_sinsin_piby4		# 4			; Done
+		.quad	.Lsincos_sincos_piby4		# 5		*	; Done
+		.quad	.Lsincos_cossin_piby4		# 6			; Done
+		.quad	.Lsincos_coscos_piby4		# 7			; Done
+
+		.quad	.Lcossin_sinsin_piby4		# 8			; Done
+		.quad	.Lcossin_sincos_piby4		# 9			; TBD
+		.quad	.Lcossin_cossin_piby4		# 10		*	; Done
+		.quad	.Lcossin_coscos_piby4		# 11			; Done
+
+		.quad	.Lcoscos_sinsin_piby4		# 12			; Done
+		.quad	.Lcoscos_sincos_piby4		# 13		+	; Done
+		.quad	.Lcoscos_cossin_piby4		# 14			; Done
+		.quad	.Lcoscos_coscos_piby4		# 15		*	; Done
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        .weak vrsa_sinf_
+        .set vrsa_sinf_,__vrsa_sinf__
+        .weak vrsa_sinf__
+        .set vrsa_sinf__,__vrsa_sinf__
+
+    .text
+    .align 16
+    .p2align 4,,15
+
+#FORTRAN subroutine implementation of array sin
+#VRSA_SINF(N,X,Y)
+#C equivalent*/
+#void vrsa_sinf__(int * n, double *x, double *y)
+#{
+#       vrsa_sinf(*n,x,y);
+#}
+
+.globl __vrsa_sinf__
+    .type   __vrsa_sinf__,@function
+__vrsa_sinf__:
+    mov         (%rdi),%edi
+
+    .align 16
+    .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ	p_temp,0		# temporary for get/put bits operation
+.equ	p_temp1,0x10		# temporary for get/put bits operation
+
+.equ	save_xmm6,0x20		# temporary for get/put bits operation
+.equ	save_xmm7,0x30		# temporary for get/put bits operation
+.equ	save_xmm8,0x40		# temporary for get/put bits operation
+.equ	save_xmm9,0x50		# temporary for get/put bits operation
+.equ	save_xmm0,0x60		# temporary for get/put bits operation
+.equ	save_xmm11,0x70		# temporary for get/put bits operation
+.equ	save_xmm12,0x80		# temporary for get/put bits operation
+.equ	save_xmm13,0x90		# temporary for get/put bits operation
+.equ	save_xmm14,0x0A0		# temporary for get/put bits operation
+.equ	save_xmm15,0x0B0		# temporary for get/put bits operation
+
+.equ	r,0x0C0		# pointer to r for remainder_piby2
+.equ	rr,0x0D0		# pointer to r for remainder_piby2
+.equ	region,0x0E0		# pointer to r for remainder_piby2
+
+.equ	r1,0x0F0		# pointer to r for remainder_piby2
+.equ	rr1,0x0100		# pointer to r for remainder_piby2
+.equ	region1,0x0110		# pointer to r for remainder_piby2
+
+.equ	p_temp2,0x0120		# temporary for get/put bits operation
+.equ	p_temp3,0x0130		# temporary for get/put bits operation
+
+.equ	p_temp4,0x0140		# temporary for get/put bits operation
+.equ	p_temp5,0x0150		# temporary for get/put bits operation
+
+.equ	p_original,0x0160		# original x
+.equ	p_mask,0x0170		# original x
+.equ	p_sign,0x0180		# original x
+
+.equ	p_original1,0x0190		# original x
+.equ	p_mask1,0x01A0		# original x
+.equ	p_sign1,0x01B0		# original x
+
+.equ	save_r12,0x01C0		# temporary for get/put bits operation
+.equ	save_r13,0x01D0		# temporary for get/put bits operation
+
+.equ	save_xa,0x01E0		#qword
+.equ	save_ya,0x01F0		#qword
+
+.equ	save_nv,0x0200		#qword
+.equ	p_iter,0x0210		# qword	storage for number of loop iterations
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.globl vrsa_sinf
+    .type   vrsa_sinf,@function
+vrsa_sinf:
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# parameters are passed in by Linux as:
+# rcx - int n
+# rdx - double *x
+# r8  - double *y
+
+	sub	$0x0228,%rsp
+	mov	%r12,save_r12(%rsp)	# save r12
+	mov	%r13,save_r13(%rsp)	# save r13
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START PROCESS INPUT
+# save the arguments
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+	mov		%rdx,save_ya(%rsp)	# save y_array pointer
+#ifdef INTEGER64
+        mov             %rdi,%rax
+#else
+        mov             %edi,%eax
+        mov             %rax,%rdi
+#endif
+	mov		%rdi,save_nv(%rsp)	# save number of values
+# see if too few values to call the main loop
+	shr		$2,%rax			# get number of iterations
+	jz		.L__vrsa_cleanup		# jump if only single calls
+# prepare the iteration counts
+	mov		%rax,p_iter(%rsp)	# save number of iterations
+	shl		$2,%rax
+	sub		%rax,%rdi		# compute number of extra single calls
+	mov		%rdi,save_nv(%rsp)	# save number of left over values
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrsa_top:
+# build the input _m128d
+	mov		save_xa(%rsp),%rsi	# get x_array pointer
+	movlps	(%rsi),%xmm0
+	movhps	8(%rsi),%xmm0
+
+	prefetch	32(%rsi)
+	add		$16,%rsi
+	mov		%rsi,save_xa(%rsp)	# save x_array pointer
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# V4 START
+	movhlps		%xmm0,%xmm8
+	cvtps2pd	%xmm0,%xmm10			# convert input to double.
+	cvtps2pd	%xmm8,%xmm1			# convert input to double.
+
+movdqa	%xmm10,%xmm6
+movdqa	%xmm1,%xmm7
+movapd	.L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd 	%xmm2,%xmm10	#Unsign
+andpd 	%xmm2,%xmm1	#Unsign
+
+movd	%xmm10,%rax				#rax is lower arg
+movhpd	%xmm10, p_temp+8(%rsp)			#
+mov    	p_temp+8(%rsp),%rcx			#rcx = upper arg
+
+movd	%xmm1,%r8				#r8 is lower arg
+movhpd	%xmm1, p_temp1+8(%rsp)			#
+mov    	p_temp1+8(%rsp),%r9			#r9 = upper arg
+
+movdqa	%xmm10,%xmm12
+movdqa	%xmm1,%xmm13
+
+pcmpgtd		%xmm6,%xmm12
+pcmpgtd		%xmm7,%xmm13
+movdqa		%xmm12,%xmm6
+movdqa		%xmm13,%xmm7
+psrldq		$4,%xmm12
+psrldq		$4,%xmm13
+psrldq		$8,%xmm6
+psrldq		$8,%xmm7
+
+mov 	$0x3FE921FB54442D18,%rdx			#piby4	+
+mov	$0x411E848000000000,%r10			#5e5	+
+movapd	.L__real_3fe0000000000000(%rip),%xmm4		#0.5 for later use +
+
+por	%xmm6,%xmm12
+por	%xmm7,%xmm13
+movd	%xmm12,%r12				#Move Sign to gpr **
+movd	%xmm13,%r13				#Move Sign to gpr **
+
+movapd	%xmm10,%xmm2				#x0
+movapd	%xmm1,%xmm3				#x1
+movapd	%xmm10,%xmm6				#x0
+movapd	%xmm1,%xmm7				#x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+	cmp	%r10,%rax
+	jae	.Lfirst_or_next3_arg_gt_5e5
+
+	cmp	%r10,%rcx
+	jae	.Lsecond_or_next2_arg_gt_5e5
+
+	cmp	%r10,%r8
+	jae	.Lthird_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfourth_arg_gt_5e5
+
+
+#      /* Find out what multiple of piby2 */
+#        npi2  = (int)(x * twobypi + 0.5);
+	movapd	.L__real_3fe45f306dc9c883(%rip),%xmm10
+	mulpd	%xmm10,%xmm2						# * twobypi
+	mulpd	%xmm10,%xmm3						# * twobypi
+
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+
+	movd	%xmm4,%r8						# Region
+	movd	%xmm5,%r9						# Region
+
+	mov 	.L__reald_one_zero(%rip),%rdx			#compare value for cossin path
+	mov	%r8,%r10
+	mov	%r9,%r11
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm1,%xmm7						# t-rhead
+
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = npi2 (int), xmm10 =rhead, xmm8 =rtail
+# xmm5  = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+# GET_BITS_DP64(rhead-rtail, uy);			   		; originally only rhead
+# xmm4  = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 =rhead, xmm9 =rtail
+	movapd	%xmm10,%xmm6						# rhead
+	movapd	%xmm1,%xmm7						# rhead
+
+	subpd	%xmm8,%xmm10						# r = rhead - rtail
+	subpd	%xmm9,%xmm1						# r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4  = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5  = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2				# move r for r2
+	movapd	%xmm1,%xmm3				# move r for r2
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+	lea	.Levensin_oddcos_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+	cmp	%r10,%rcx				#is upper arg >= 5e5
+	jae	.Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+#	    %xmm11,,%xmm9 xmm13
+
+
+	movlpd	 %xmm10,r(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm10,%xmm10			#Needed since we want to work on upper arg
+	movhlps	%xmm2,%xmm2
+	movhlps	%xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm8 = piby2_1
+	cvttsd2si	%xmm2,%ecx				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm0 = piby2_2
+	cvtsi2sd	%ecx,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm12 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     					# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %ecx,region+4(%rsp)			# store upper region
+       movsd	%xmm6,%xmm10
+       subsd	%xmm0,%xmm10					# xmm10 = r=(rhead-rtail)
+       movlpd	 %xmm10,r+8(%rsp)			# store upper r
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_lower_naninf
+
+	mov	  %r8,p_temp(%rsp)
+	mov	  %r9,p_temp2(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r(%rsp),%rdi	#Restore lower fp arg for remainder_piby2 call
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp(%rsp),%r8
+	mov	 p_temp2(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_sinf_lower_naninf:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+	movhlps %xmm10,%xmm6		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%rax,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_lower_naninf_of_both_gt_5e5
+
+	mov	  %rcx,p_temp(%rsp)			#Save upper arg
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+	movd	 %xmm10,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	mov	 p_temp(%rsp),%rcx			#Restore upper arg
+	jmp 	0f
+
+.L__vrs4_sinf_lower_naninf_of_both_gt_5e5:				#lower arg is nan/inf
+#	mov	.LQWORD,%rax PTR p_original[rsp]
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rax
+	mov	 %rax,r(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_upper_naninf_of_both_gt_5e5
+
+
+	mov	  %r8,p_temp2(%rsp)
+	mov	  %r9,p_temp4(%rsp)
+	movapd	  %xmm1,p_temp1(%rsp)
+	movapd	  %xmm3,p_temp3(%rsp)
+	movapd	  %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm6,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp2(%rsp),%r8
+	mov	 p_temp4(%rsp),%r9
+	movapd	 p_temp1(%rsp),%xmm1
+	movapd	 p_temp3(%rsp),%xmm3
+	movapd	 p_temp5(%rsp),%xmm7
+
+	jmp 	0f
+
+.L__vrs4_sinf_upper_naninf_of_both_gt_5e5:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+#   %xmm9,,%xmm5 xmm11, xmm13
+
+	movhpd	 %xmm10,r+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# x*twobypi
+	addsd	%xmm4,%xmm2					# xmm2 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm8		# xmm3 = piby2_1
+	cvttsd2si	%xmm2,%eax				# ecx = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm0		# xmm1 = piby2_2
+	cvtsi2sd	%eax,%xmm2				# xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm2,%xmm8					# npi2 * piby2_1
+	subsd	%xmm8,%xmm6					# xmm6 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm12		# xmm7 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm6,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm2,%xmm0					# xmm1 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm0,%xmm6					# xmm6 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm2,%xmm12     				# npi2 * piby2_2tail
+       subsd	%xmm6,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm0					# (rtail-(t-rhead))
+       addsd	%xmm12,%xmm0					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %eax,region(%rsp)			# store upper region
+
+        subsd	%xmm0,%xmm6					# xmm10 = r=(rhead-rtail)
+
+        movlpd	 %xmm6,r(%rsp)				# store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%rcx,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_upper_naninf
+
+	mov	 %r8,p_temp(%rsp)
+	mov	 %r9,p_temp2(%rsp)
+	movapd	 %xmm1,p_temp1(%rsp)
+	movapd	 %xmm3,p_temp3(%rsp)
+	movapd	 %xmm7,p_temp5(%rsp)
+
+	lea	 region+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r+8(%rsp),%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	p_temp(%rsp),%r8
+	mov	p_temp2(%rsp),%r9
+	movapd	p_temp1(%rsp),%xmm1
+	movapd	p_temp3(%rsp),%xmm3
+	movapd	p_temp5(%rsp),%xmm7
+	jmp 	0f
+
+.L__vrs4_sinf_upper_naninf:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%rcx
+	mov	 %rcx,r+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+
+	cmp	%r10,%r8
+	jae	.Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+	cmp	%r10,%r9
+	jae	.Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4			#Restore 0.5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm3			# * twobypi
+	addpd	%xmm4,%xmm3						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm1		# piby2_1
+	cvttpd2dq	%xmm3,%xmm5					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm9		# piby2_2
+	cvtdq2pd	%xmm5,%xmm3					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm5,region1(%rsp)						# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm3,%xmm1						# npi2 * piby2_1;
+
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm3,%xmm9						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm1,%xmm7						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm7,%xmm1						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm9,%xmm1						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm3		# npi2 * piby2_2tail
+
+       subpd	%xmm1,%xmm7						# t-rhead
+       subpd	%xmm7,%xmm9						# - ((t - rhead) - rtail)
+       addpd	%xmm3,%xmm9						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm1,%xmm7						; rhead
+       subpd	%xmm9,%xmm1						# r = rhead - rtail
+       movapd	 %xmm1,r1(%rsp)
+
+#       subpd	%xmm1,%xmm7						; rr=rhead-r
+#       subpd	xmm7, xmm9						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr1[rsp], xmm7
+
+	jmp	.L__vrs4_sinf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use 	%xmm11,,%xmm9 xmm13
+# 	%xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm10,%xmm6						; rhead
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	 %xmm10,r(%rsp)
+
+#       subpd	%xmm10,%xmm6						; rr=rhead-r
+#       subpd	xmm6, xmm8						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+	mov	$0x411E848000000000,%r10			#5e5	+
+	cmp	%r10,%r9
+	jae	.Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movlpd	 %xmm1,r1(%rsp)		#Save lower fp arg for remainder_piby2 call
+	movhlps	%xmm1,%xmm1			#Needed since we want to work on upper arg
+	movhlps	%xmm3,%xmm3
+	movhlps	%xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r9d				# r9d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r9d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r9d,region1+4(%rsp)			# store upper region
+
+       subsd	%xmm10,%xmm7					# xmm1 = r=(rhead-rtail)
+
+       movlpd	 %xmm7,r1+8(%rsp)			# store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+	mov		$0x07ff0000000000000,%r11			# is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_lower_naninf_higher
+
+	lea	 region1(%rsp),%rdx			# lower arg is **NOT** nan/inf
+	lea	 r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+  	mov	 r1(%rsp),%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sinf_lower_naninf_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			# region =0
+
+.align 16
+0:
+	jmp 	.L__vrs4_sinf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+	movhlps %xmm1,%xmm7		#Save upper fp arg for remainder_piby2 call
+
+	mov		$0x07ff0000000000000,%r11			#is lower arg nan/inf
+	mov		%r11,%r10
+	and		%r8,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher
+
+	mov	  %r9,p_temp1(%rsp)			#Save upper arg
+	lea	 region1(%rsp),%rdx			#lower arg is **NOT** nan/inf
+	lea	 r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm1,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	mov	 p_temp1(%rsp),%r9			#Restore upper arg
+
+	jmp 	0f
+
+.L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher:				#lower arg is nan/inf
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r8
+	mov	 %r8,r1(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1(%rsp)			#region = 0
+
+.align 16
+0:
+	mov		$0x07ff0000000000000,%r11			#is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher
+
+	lea	 region1+4(%rsp),%rdx			#upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	movd	 %xmm7,%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)				#r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			#region = 0
+
+.align 16
+0:
+	jmp 	.L__vrs4_sinf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+	mulpd	.L__real_3fe45f306dc9c883(%rip),%xmm2		# * twobypi
+	addpd	%xmm4,%xmm2						# +0.5, npi2
+	movapd	.L__real_3ff921fb54400000(%rip),%xmm10		# piby2_1
+	cvttpd2dq	%xmm2,%xmm4					# convert packed double to packed integers
+	movapd	.L__real_3dd0b4611a600000(%rip),%xmm8		# piby2_2
+	cvtdq2pd	%xmm4,%xmm2					# and back to double.
+
+###
+#      /* Subtract the multiple from x to get an extra-precision remainder */
+	movlpd	 %xmm4,region(%rsp)				# Region
+###
+
+#      rhead  = x - npi2 * piby2_1;
+       mulpd	%xmm2,%xmm10						# npi2 * piby2_1;
+#      rtail  = npi2 * piby2_2;
+       mulpd	%xmm2,%xmm8						# rtail
+
+#      rhead  = x - npi2 * piby2_1;
+       subpd	%xmm10,%xmm6						# rhead  = x - npi2 * piby2_1;
+
+#      t  = rhead;
+       movapd	%xmm6,%xmm10						# t
+
+#      rhead  = t - rtail;
+       subpd	%xmm8,%xmm10						# rhead
+
+#      rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulpd	.L__real_3ba3198a2e037073(%rip),%xmm2		# npi2 * piby2_2tail
+
+       subpd	%xmm10,%xmm6						# t-rhead
+       subpd	%xmm6,%xmm8						# - ((t - rhead) - rtail)
+       addpd	%xmm2,%xmm8						# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#       movapd	%xmm10,%xmm6						; rhead
+       subpd	%xmm8,%xmm10						# r = rhead - rtail
+       movapd	 %xmm10,r(%rsp)
+
+#       subpd	%xmm10,%xmm6						; rr=rhead-r
+#       subpd	xmm6, xmm8						; rr=(rhead-r) -rtail
+#       movapd	OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+	movhpd	 %xmm1,r1+8(%rsp)	#Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+	movapd	.L__real_3fe0000000000000(%rip),%xmm4		# Restore 0.5
+	mulsd	.L__real_3fe45f306dc9c883(%rip),%xmm3		# x*twobypi
+	addsd	%xmm4,%xmm3					# xmm3 = npi2=(x*twobypi+0.5)
+	movsd	.L__real_3ff921fb54400000(%rip),%xmm2		# xmm2 = piby2_1
+	cvttsd2si	%xmm3,%r8d				# r8d = npi2 trunc to ints
+	movsd	.L__real_3dd0b4611a600000(%rip),%xmm10		# xmm10 = piby2_2
+	cvtsi2sd	%r8d,%xmm3				# xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead  = x - npi2 * piby2_1;
+	mulsd	%xmm3,%xmm2					# npi2 * piby2_1
+	subsd	%xmm2,%xmm7					# xmm7 = rhead =(x-npi2*piby2_1)
+	movsd	.L__real_3ba3198a2e037073(%rip),%xmm6		# xmm6 =piby2_2tail
+
+#t  = rhead;
+       movsd	%xmm7,%xmm5					# xmm5 = t = rhead
+
+#rtail  = npi2 * piby2_2;
+       mulsd	%xmm3,%xmm10					# xmm10 =rtail=(npi2*piby2_2)
+
+#rhead  = t - rtail
+       subsd	%xmm10,%xmm7					# xmm7 =rhead=(t-rtail)
+
+#rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+       mulsd	%xmm3,%xmm6     					# npi2 * piby2_2tail
+       subsd	%xmm7,%xmm5					# t-rhead
+       subsd	%xmm5,%xmm10					# (rtail-(t-rhead))
+       addsd	%xmm6,%xmm10					# rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r =  rhead - rtail
+#rr = (rhead-r) -rtail
+       mov	 %r8d,region1(%rsp)			# store lower region
+
+#       movsd	%xmm7,%xmm1
+#       subsd	xmm1, xmm10					; xmm10 = r=(rhead-rtail)
+#       subsd	%xmm1,%xmm7					; rr=rhead-r
+#       subsd	xmm7, xmm10					; xmm6 = rr=((rhead-r) -rtail)
+
+        subsd	%xmm10,%xmm7					# xmm10 = r=(rhead-rtail)
+
+#       movlpd	QWORD PTR r1[rsp], xmm1				; store upper r
+#       movlpd	QWORD PTR rr1[rsp], xmm7			; store upper rr
+
+        movlpd	 %xmm7,r1(%rsp)				# store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+	mov		$0x07ff0000000000000,%r11			# is upper arg nan/inf
+	mov		%r11,%r10
+	and		%r9,%r10
+	cmp		%r11,%r10
+	jz		.L__vrs4_sinf_upper_naninf_higher
+
+	lea	 region1+4(%rsp),%rdx			# upper arg is **NOT** nan/inf
+	lea	 r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+	mov	 r1+8(%rsp),%rdi
+
+	call	 __remainder_piby2d2f@PLT
+
+	jmp 	0f
+
+.L__vrs4_sinf_upper_naninf_higher:
+	mov	$0x00008000000000000,%r11
+	or	%r11,%r9
+	mov	 %r9,r1+8(%rsp)				# r = x | 0x0008000000000000
+	mov	 %r10d,region1+4(%rsp)			# region =0
+
+.align 16
+0:
+	jmp	.L__vrs4_sinf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sinf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	r(%rsp),%xmm10
+	movapd	r1(%rsp),%xmm1
+
+	mov	region(%rsp),%r8
+	mov	region1(%rsp),%r9
+	mov 	.L__reald_one_zero(%rip),%rdx		#compare value for cossin path
+
+	mov 	%r8,%r10
+	mov 	%r9,%r11
+
+	and	.L__reald_one_one(%rip),%r8		#odd/even region for cos/sin
+	and	.L__reald_one_one(%rip),%r9		#odd/even region for cos/sin
+
+	shr	$1,%r10						#~AB+A~B, A is sign and B is upper bit of region
+	shr	$1,%r11						#~AB+A~B, A is sign and B is upper bit of region
+
+	mov	%r10,%rax
+	mov	%r11,%rcx
+
+	not 	%r12						#ADDED TO CHANGE THE LOGIC
+	not 	%r13						#ADDED TO CHANGE THE LOGIC
+	and	%r12,%r10
+	and	%r13,%r11
+
+	not	%rax
+	not	%rcx
+	not	%r12
+	not	%r13
+	and	%r12,%rax
+	and	%r13,%rcx
+
+	or	%rax,%r10
+	or	%rcx,%r11
+	and	.L__reald_one_one(%rip),%r10				#(~AB+A~B)&1
+	and	.L__reald_one_one(%rip),%r11				#(~AB+A~B)&1
+
+	mov	%r10,%r12
+	mov	%r11,%r13
+
+	and	%rdx,%r12				#mask out the lower sign bit leaving the upper sign bit
+	and	%rdx,%r13				#mask out the lower sign bit leaving the upper sign bit
+
+	shl	$63,%r10				#shift lower sign bit left by 63 bits
+	shl	$63,%r11				#shift lower sign bit left by 63 bits
+	shl	$31,%r12				#shift upper sign bit left by 31 bits
+	shl	$31,%r13				#shift upper sign bit left by 31 bits
+
+	mov 	 %r10,p_sign(%rsp)		#write out lower sign bit
+	mov 	 %r12,p_sign+8(%rsp)		#write out upper sign bit
+	mov 	 %r11,p_sign1(%rsp)		#write out lower sign bit
+	mov 	 %r13,p_sign1+8(%rsp)		#write out upper sign bit
+
+	mov	%r8,%rax
+	mov	%r9,%rcx
+
+	movapd	%xmm10,%xmm2
+	movapd	%xmm1,%xmm3
+
+	mulpd	%xmm10,%xmm2				# r2
+	mulpd	%xmm1,%xmm3				# r2
+
+	and	.L__reald_zero_one(%rip),%rax
+	and	.L__reald_zero_one(%rip),%rcx
+	shr	$31,%r8
+	shr	$31,%r9
+	or	%r8,%rax
+	or	%r9,%rcx
+	shl	$2,%rcx
+	or	%rcx,%rax
+
+
+	lea	.Levensin_oddcos_tbl(%rip),%rcx
+	jmp	*(%rcx,%rax,8)				#Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrsa_sinf_cleanup:
+
+	movapd	  p_sign(%rsp),%xmm10
+	movapd	  p_sign1(%rsp),%xmm1
+
+	xorpd	%xmm4,%xmm10			# (+) Sign
+	xorpd	%xmm5,%xmm1			# (+) Sign
+
+	cvtpd2ps %xmm10,%xmm0
+	cvtpd2ps %xmm1,%xmm11
+	movlhps	 %xmm11,%xmm0
+
+# NEW
+
+.L__vrsa_bottom1:
+# store the result _m128d
+	mov	 save_ya(%rsp),%rdi		# get y_array pointer
+	movlps	 %xmm0,(%rdi)
+	movhps	 %xmm0,8(%rdi)
+
+	prefetch	32(%rdi)
+	add		$16,%rdi
+	mov		%rdi,save_ya(%rsp)	# save y_array pointer
+
+	mov	p_iter(%rsp),%rax	# get number of iterations
+	sub	$1,%rax
+	mov	%rax,p_iter(%rsp)	# save number of iterations
+	jnz	.L__vrsa_top
+
+# see if we need to do any extras
+	mov	save_nv(%rsp),%rax	# get number of values
+	test	%rax,%rax
+	jnz	.L__vrsa_cleanup
+
+.L__final_check:
+
+# NEW
+
+	mov	save_r12(%rsp),%r12	# restore r12
+	mov	save_r13(%rsp),%r13	# restore r13
+
+	add	$0x0228,%rsp
+	ret
+
+#NEW
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# we assume that rdx is pointing at the next x array element, r8 at the next y array element.
+# The number of values left is in save_nv
+
+.align	16
+.L__vrsa_cleanup:
+        mov             save_nv(%rsp),%rax      # get number of values
+        test            %rax,%rax               # are there any values
+        jz              .L__final_check         # exit if not
+
+	mov		save_xa(%rsp),%rsi
+	mov		save_ya(%rsp),%rdi
+
+
+# START WORKING FROM HERE
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+	xorps		 %xmm0,%xmm0
+	movss		 %xmm0,p_temp+4(%rsp)
+	movlps		 %xmm0,p_temp+8(%rsp)
+
+
+	mov		 (%rsi),%ecx			# we know there's at least one
+	mov	 	 %ecx,p_temp(%rsp)
+	cmp		 $2,%rax
+	jl		 .L__vrsacg
+
+	mov		 4(%rsi),%ecx			# do the second value
+	mov	 	 %ecx,p_temp+4(%rsp)
+	cmp		 $3,%rax
+	jl		 .L__vrsacg
+
+	mov		 8(%rsi),%ecx			# do the third value
+	mov	 	 %ecx,p_temp+8(%rsp)
+
+.L__vrsacg:
+	mov		 $4,%rdi			# parameter for N
+	lea		 p_temp(%rsp),%rsi		# &x parameter
+	lea		 p_temp2(%rsp),%rdx 		# &y parameter
+	call		 vrsa_sinf@PLT			# call recursively to compute four values
+
+# now copy the results to the destination array
+	mov		 save_ya(%rsp),%rdi
+	mov		 save_nv(%rsp),%rax		# get number of values
+
+	mov	 	 p_temp2(%rsp),%ecx
+	mov		 %ecx,(%rdi)			# we know there's at least one
+	cmp		 $2,%rax
+	jl		 .L__vrsacgf
+
+	mov	 	 p_temp2+4(%rsp),%ecx
+	mov		 %ecx,4(%rdi)			# do the second value
+	cmp		 $3,%rax
+	jl		 .L__vrsacgf
+
+	mov	 	 p_temp2+8(%rsp),%ecx
+	mov		 %ecx,8(%rdi)			# do the third value
+
+.L__vrsacgf:
+	jmp		.L__final_check
+
+#NEW
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+	movapd	%xmm2,%xmm0					# r
+	movapd	%xmm3,%xmm11					# r
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0	# r = 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c3*x2
+	mulpd	%xmm3,%xmm9					# c3*x2
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0	;trash r
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0	;trash r
+
+	mulpd	%xmm2,%xmm2					# x4
+	mulpd	%xmm3,%xmm3					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	addpd	.Lcosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm2,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm3,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x4 * zc
+
+	subpd   %xmm0,%xmm4					# + t
+	subpd   %xmm11,%xmm5					# + t
+
+	jmp 	.L__vrsa_sinf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4		# s4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5		# s4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8		# s2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4		# s4+x2s3
+	addpd	.Lsincosarray+0x20(%rip),%xmm5		# s4+x2s3
+	addpd	.Lsincosarray(%rip),%xmm8		# s2+x2s1
+	addpd	.Lsincosarray(%rip),%xmm9		# s2+x2s1
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm0,%xmm0				# move high x4 for cos term
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term
+	movsd	%xmm3,%xmm7				# move low x2 for x3 for sin term
+	mulsd	%xmm10,%xmm6				# get low x3 for sin term
+	mulsd	%xmm1,%xmm7				# get low x3 for sin term
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for sin and cos terms
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for sin and cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm2,%xmm12				# move high r for cos
+	movhlps	%xmm3,%xmm13				# move high r for cos
+
+	movhlps	%xmm4,%xmm8				# xmm4 = sin , xmm8 = cos
+	movhlps	%xmm5,%xmm9				# xmm4 = sin , xmm8 = cos
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm7,%xmm5				# sin *x3
+
+	mulsd	%xmm0,%xmm8				# cos *x4
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm13 	#-t=r-1.0
+
+	addsd	%xmm10,%xmm4				# sin + x
+	addsd	%xmm1,%xmm5				# sin + x
+	subsd   %xmm12,%xmm8				# cos+t
+	subsd   %xmm13,%xmm9				# cos+t
+
+	movlhps	%xmm8,%xmm4
+	movlhps	%xmm9,%xmm5
+
+	jmp 	.L__vrsa_sinf_cleanup
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:
+
+	movapd	.Lsincosarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lcossinarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lsincosarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lcossinarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm3,%xmm7				# sincos term upper x2 for x3
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4		# s3+x2s4
+	addpd	.Lcossinarray+0x20(%rip),%xmm5		# s3+x2s4
+	addpd	.Lsincosarray(%rip),%xmm8		# s1+x2s2
+	addpd	.Lcossinarray(%rip),%xmm9		# s1+x2s2
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm0,%xmm0				# move high x4 for cos term
+
+	movsd	%xmm2,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	mulpd	%xmm1,%xmm7
+
+	mulsd	%xmm10,%xmm6				# get low x3 for sin term (cossin)
+	movhlps	%xmm7,%xmm7				# get high x3 for sin term (sincos)
+
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos term
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+
+	movhlps	%xmm2,%xmm12				# move high r for cos (cossin)
+
+
+	movhlps	%xmm4,%xmm8				# xmm8 = cos , xmm4 = sin	(cossin)
+	movhlps	%xmm5,%xmm9				# xmm9 = sin , xmm5 = cos	(sincos)
+
+	mulsd	%xmm6,%xmm4				# sin *x3
+	mulsd	%xmm11,%xmm5				# cos *x4
+	mulsd	%xmm0,%xmm8				# cos *x4
+	mulsd	%xmm7,%xmm9				# sin *x3
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	# -t=r-1.0
+
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term    (sincos)
+
+	addsd	%xmm10,%xmm4				# sin + x	+
+	addsd	%xmm11,%xmm9				# sin + x	+
+
+	subsd   %xmm12,%xmm8				# cos+t
+	subsd   %xmm3,%xmm5				# cos+t
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrsa_sinf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	.Lcossinarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lcossinarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lcossinarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lcossinarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm2,%xmm6				# move x2 for x4
+	movapd	%xmm3,%xmm7				# move x2 for x4
+
+	mulpd	%xmm2,%xmm4				# x2s6
+	mulpd	%xmm3,%xmm5				# x2s6
+	mulpd	%xmm2,%xmm8				# x2s3
+	mulpd	%xmm3,%xmm9				# x2s3
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4		# s4+x2s3
+	addpd	.Lcossinarray+0x20(%rip),%xmm5		# s4+x2s3
+	addpd	.Lcossinarray(%rip),%xmm8		# s2+x2s1
+	addpd	.Lcossinarray(%rip),%xmm9		# s2+x2s1
+
+	mulpd	%xmm0,%xmm4				# x4(s4+x2s3)
+	mulpd	%xmm11,%xmm5				# x4(s4+x2s3)
+
+	mulpd	%xmm10,%xmm6				# get low x3 for sin term
+	mulpd	%xmm1,%xmm7				# get low x3 for sin term
+	movhlps	%xmm6,%xmm6				# move low x2 for x3 for sin term
+	movhlps	%xmm7,%xmm7				# move low x2 for x3 for sin term
+
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos terms
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos terms
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+	movhlps	%xmm4,%xmm12				# xmm8 = sin , xmm4 = cos
+	movhlps	%xmm5,%xmm13				# xmm9 = sin , xmm5 = cos
+
+	mulsd	%xmm6,%xmm12				# sin *x3
+	mulsd	%xmm7,%xmm13				# sin *x3
+	mulsd	%xmm0,%xmm4				# cos *x4
+	mulsd	%xmm11,%xmm5				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	#-t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm3 	#-t=r-1.0
+
+	movhlps	%xmm10,%xmm0				# move high x for x for sin term
+	movhlps	%xmm1,%xmm11				# move high x for x for sin term
+							# Reverse 10 and 0
+
+	addsd	%xmm0,%xmm12				# sin + x
+	addsd	%xmm11,%xmm13				# sin + x
+
+	subsd   %xmm2,%xmm4				# cos+t
+	subsd   %xmm3,%xmm5				# cos+t
+
+	movlhps	%xmm12,%xmm4
+	movlhps	%xmm13,%xmm5
+	jmp 	.L__vrsa_sinf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+
+	movapd	.Lcossinarray+0x30(%rip),%xmm4		# s4
+	movapd	.Lsincosarray+0x30(%rip),%xmm5		# s4
+	movdqa	.Lcossinarray+0x10(%rip),%xmm8		# s2
+	movdqa	.Lsincosarray+0x10(%rip),%xmm9		# s2
+
+	movapd	%xmm2,%xmm0				# move x2 for x4
+	movapd	%xmm3,%xmm11				# move x2 for x4
+	movapd	%xmm2,%xmm7				# upper x2 for x3 for sin term (sincos)
+
+	mulpd	%xmm2,%xmm4				# x2s4
+	mulpd	%xmm3,%xmm5				# x2s4
+	mulpd	%xmm2,%xmm8				# x2s2
+	mulpd	%xmm3,%xmm9				# x2s2
+
+	mulpd	%xmm2,%xmm0				# x4
+	mulpd	%xmm3,%xmm11				# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4		# s3+x2s4
+	addpd	.Lsincosarray+0x20(%rip),%xmm5		# s3+x2s4
+	addpd	.Lcossinarray(%rip),%xmm8		# s1+x2s2
+	addpd	.Lsincosarray(%rip),%xmm9		# s1+x2s2
+
+	mulpd	%xmm0,%xmm4				# x4(s3+x2s4)
+	mulpd	%xmm11,%xmm5				# x4(s3+x2s4)
+
+	movhlps	%xmm11,%xmm11				# move high x4 for cos term
+
+	movsd	%xmm3,%xmm6				# move low x2 for x3 for sin term  (cossin)
+	mulpd	%xmm10,%xmm7
+
+	mulsd	%xmm1,%xmm6				# get low x3 for sin term (cossin)
+	movhlps	%xmm7,%xmm7				# get high x3 for sin term (sincos)
+
+	mulsd 	.L__real_3fe0000000000000(%rip),%xmm2	# 0.5*x2 for cos term
+	mulpd 	.L__real_3fe0000000000000(%rip),%xmm3	# 0.5*x2 for cos term
+
+	addpd	%xmm8,%xmm4				# z
+	addpd	%xmm9,%xmm5				# z
+
+
+	movhlps	%xmm3,%xmm12				# move high r for cos (cossin)
+
+
+	movhlps	%xmm4,%xmm8				# xmm8 = sin , xmm4 = cos	(sincos)
+	movhlps	%xmm5,%xmm9				# xmm9 = cos , xmm5 = sin	(cossin)
+
+	mulsd	%xmm0,%xmm4				# cos *x4
+	mulsd	%xmm6,%xmm5				# sin *x3
+	mulsd	%xmm7,%xmm8				# sin *x3
+	mulsd	%xmm11,%xmm9				# cos *x4
+
+	subsd	.L__real_3ff0000000000000(%rip),%xmm2 	# -t=r-1.0
+	subsd	.L__real_3ff0000000000000(%rip),%xmm12 	# -t=r-1.0
+
+	movhlps	%xmm10,%xmm11				# move high x for x for sin term    (sincos)
+
+	subsd	%xmm2,%xmm4				# cos-(-t)
+	subsd	%xmm12,%xmm9				# cos-(-t)
+
+	addsd   %xmm11,%xmm8				# sin + x
+	addsd   %xmm1,%xmm5				# sin + x
+
+	movlhps	%xmm8,%xmm4				# cossin
+	movlhps	%xmm9,%xmm5				# sincos
+
+	jmp	.L__vrsa_sinf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: 	SIN
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr:	COS
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm0					# x2	; SIN
+	movapd	%xmm3,%xmm11					# x2	; COS
+	movapd	%xmm3,%xmm1					# copy of x2 for x4
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm0					# x4
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm3,%xmm1					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lcosarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+
+	mulpd	%xmm0,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm1,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm1,%xmm5					# x4 * zc
+
+	addpd	%xmm10,%xmm4					# +x
+	subpd   %xmm11,%xmm5					# +t
+
+	jmp 	.L__vrsa_sinf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: 	COS
+# p_sign1  = Sign, xmm1  = r, xmm3 = %xmm7,%r2 =rr:	SIN
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+	movapd	%xmm2,%xmm0					# x2	; COS
+	movapd	%xmm3,%xmm11					# x2	; SIN
+	movapd	%xmm2,%xmm10					# copy of x2 for x4
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# s4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# s2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# s4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# s2*x2
+
+	mulpd	%xmm2,%xmm10					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# s3+x2c4
+	addpd	.Lcosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9			# s1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm10,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm11,%xmm5					# x4(s3+x2s4)
+
+	subpd	.L__real_3ff0000000000000(%rip),%xmm0		# -t=r-1.0
+	addpd	%xmm8,%xmm4					# zc
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm10,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# x3 * zc
+
+	subpd	%xmm0,%xmm4					# +t
+	addpd   %xmm1,%xmm5					# +x
+
+	jmp 	.L__vrsa_sinf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4:				#Derive from cossin_coscos
+	movhlps	%xmm2,%xmm0					# x2 for 0.5x2 for upper cos
+	movsd	%xmm2,%xmm6					# lower x2 for x3 for lower sin
+	movapd	%xmm3,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0	# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					#  t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lsincosarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcosarray(%rip),%xmm9			# c2+x2c1
+
+
+	movapd	%xmm12,%xmm2					# upper=x4
+	movsd	%xmm6,%xmm2					# lower=x2
+	mulsd	%xmm10,%xmm2					# lower=x2*x
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# upper= x4 * zc
+								# lower=x3 * zs
+	mulpd	%xmm13,%xmm5					# x4 * zc
+
+
+	movlhps	%xmm7,%xmm10					#
+	addpd	%xmm10,%xmm4					# +x for lower sin, +t for upper cos
+	subpd   %xmm11,%xmm5					# -(-t)
+
+	jmp 	.L__vrsa_sinf_cleanup
+.align 16
+.Lcoscos_sincos_piby4:				#Derive from cossin_coscos
+	movsd	%xmm2,%xmm0					# x2 for 0.5x2 for lower cos
+	movapd	%xmm3,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcossinarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lcosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcossinarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lcosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcossinarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcosarray(%rip),%xmm9				# c2+x2c1
+
+	mulpd	%xmm10,%xmm2					# upper=x3 for sin
+	mulsd	%xmm10,%xmm2					# lower=x4 for cos
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm2,%xmm4					# lower= x4 * zc
+								# upper= x3 * zs
+	mulpd	%xmm13,%xmm5					# x4 * zc
+
+
+	movsd	%xmm7,%xmm10
+	addpd	%xmm10,%xmm4					# +x for upper sin, +t for lower cos
+	subpd   %xmm11,%xmm5					# -(-t)
+
+	jmp 	.L__vrsa_sinf_cleanup
+.align 16
+.Lcossin_coscos_piby4:
+	movhlps	%xmm3,%xmm0					# x2 for 0.5x2 for upper cos
+	movapd	%xmm2,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	%xmm3,%xmm6					# lower x2 for x3 for sin
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9			# c2
+
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11		# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11		# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lsincosarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcosarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lsincosarray(%rip),%xmm9			# c2+x2c1
+
+	movapd	%xmm13,%xmm3					# upper=x4
+	movsd	%xmm6,%xmm3					# lower x2
+	mulsd	%xmm1,%xmm3					# lower x2*x
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm12,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# upper= x4 * zc
+								# lower=x3 * zs
+
+	movlhps	%xmm7,%xmm1
+	addpd	%xmm1,%xmm5					# +x for lower sin, +t for upper cos
+	subpd   %xmm11,%xmm4					# -(-t)
+
+	jmp 	.L__vrsa_sinf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:		# Derived from sincos_coscos
+
+	movhlps	%xmm3,%xmm0					# x2
+	movapd	%xmm3,%xmm7
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsincosarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsincosarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsincosarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsincosarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+	movapd	%xmm13,%xmm3					# upper x4 for cos
+	movsd	%xmm7,%xmm3					# lower x2 for sin
+	mulsd	%xmm1,%xmm3					# lower x3=x2*x for sin
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm11,%xmm1					# t for upper cos and x for lower sin
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# upper=x4 * zc
+								# lower=x3 * zs
+
+	addpd   %xmm10,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +t upper, +x lower
+
+
+	jmp 	.L__vrsa_sinf_cleanup
+.align 16
+.Lsincos_coscos_piby4:
+	movsd	%xmm3,%xmm0					# x2 for 0.5x2 for lower cos
+	movapd	%xmm2,%xmm11					# x2 for 0.5x2
+	movapd	%xmm2,%xmm12					# x2 for x4
+	movapd	%xmm3,%xmm13					# x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm7
+
+	movdqa	.Lcosarray+0x30(%rip),%xmm4			# cs4
+	movdqa	.Lcossinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcosarray+0x10(%rip),%xmm8			# cs2
+	movapd	.Lcossinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0	# 0.5 *x2
+	mulpd	.L__real_3fe0000000000000(%rip),%xmm11	# 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	subsd	%xmm0,%xmm7					# t=1.0-r  for cos
+	subpd	.L__real_3ff0000000000000(%rip),%xmm11	# -t=r-1.0
+	mulpd	%xmm2,%xmm12					# x4
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcosarray+0x20(%rip),%xmm4			# c4+x2c3
+	addpd	.Lcossinarray+0x20(%rip),%xmm5			# c4+x2c3
+	addpd	.Lcosarray(%rip),%xmm8			# c2+x2c1
+	addpd	.Lcossinarray(%rip),%xmm9			# c2+x2c1
+
+	mulpd	%xmm1,%xmm3					# upper=x3 for sin
+	mulsd	%xmm1,%xmm3					# lower=x4 for cos
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zczs
+	addpd	%xmm9,%xmm5					# zc
+
+	mulpd	%xmm12,%xmm4					# x4 * zc
+	mulpd	%xmm3,%xmm5					# lower= x4 * zc
+								# upper= x3 * zs
+
+	movsd	%xmm7,%xmm1
+	subpd   %xmm11,%xmm4					# -(-t)
+	addpd	%xmm1,%xmm5					# +x for upper sin, +t for lower cos
+
+
+	jmp 	.L__vrsa_sinf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4:		# Derived from sincos_coscos
+
+	movsd	%xmm3,%xmm0					# x2
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lcossinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lcossinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lcossinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsinarray(%rip),%xmm8				# c1+x2c2
+	addpd	.Lcossinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm10,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# upper x3 for sin
+	mulsd	%xmm1,%xmm3					# lower x4 for cos
+
+	movhlps	%xmm1,%xmm6
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm6,%xmm11					# upper =t ; lower =x
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# lower=x4 * zc
+								# upper=x3 * zs
+
+	addpd   %xmm10,%xmm4					# +x
+	addpd	%xmm11,%xmm5					# +t lower, +x upper
+
+	jmp 	.L__vrsa_sinf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4:		# Derived from sincos_coscos
+
+	movhlps	%xmm2,%xmm0					# x2
+	movapd	%xmm2,%xmm7
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lsincosarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lsincosarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lsincosarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lsincosarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	movapd	%xmm12,%xmm2					# upper x4 for cos
+	movsd	%xmm7,%xmm2					# lower x2 for sin
+	mulsd	%xmm10,%xmm2					# lower x3=x2*x for sin
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm11,%xmm10					# t for upper cos and x for lower sin
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# upper=x4 * zc
+								# lower=x3 * zs
+
+	addpd	%xmm1,%xmm5					# +x
+	addpd   %xmm10,%xmm4					# +t upper, +x lower
+
+	jmp 	.L__vrsa_sinf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:		# Derived from sincos_coscos
+
+	movsd	%xmm2,%xmm0					# x2
+	movapd	%xmm2,%xmm12					# copy of x2 for x4
+	movapd	%xmm3,%xmm13					# copy of x2 for x4
+	movsd	.L__real_3ff0000000000000(%rip),%xmm11
+
+	movdqa	.Lcossinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+	movapd	.Lcossinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulsd	.L__real_3fe0000000000000(%rip),%xmm0		# r = 0.5 *x2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	mulpd	%xmm2,%xmm12					# x4
+	subsd	%xmm0,%xmm11					# t=1.0-r for cos
+	mulpd	%xmm3,%xmm13					# x4
+
+	addpd	.Lcossinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+	addpd	.Lcossinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm1,%xmm3					# x3
+	mulpd	%xmm10,%xmm2					# upper x3 for sin
+	mulsd	%xmm10,%xmm2					# lower x4 for cos
+
+	movhlps	%xmm10,%xmm6
+
+	mulpd	%xmm12,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm13,%xmm5					# x4(c3+x2c4)
+
+	movlhps	%xmm6,%xmm11
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zszc
+
+	mulpd	%xmm3,%xmm5					# x3 * zs
+	mulpd	%xmm2,%xmm4					# lower=x4 * zc
+								# upper=x3 * zs
+
+	addpd	%xmm1,%xmm5					# +x
+	addpd   %xmm11,%xmm4					# +t lower, +x upper
+
+	jmp 	.L__vrsa_sinf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0  = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1  = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+  #x2 = x * x;
+  #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))));
+
+  #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4));
+
+
+	movapd	%xmm2,%xmm0					# x2
+	movapd	%xmm3,%xmm11					# x2
+
+	movdqa	.Lsinarray+0x30(%rip),%xmm4			# c4
+	movdqa	.Lsinarray+0x30(%rip),%xmm5			# c4
+
+	mulpd	%xmm2,%xmm0					# x4
+	mulpd	%xmm3,%xmm11					# x4
+
+	movapd	.Lsinarray+0x10(%rip),%xmm8			# c2
+	movapd	.Lsinarray+0x10(%rip),%xmm9			# c2
+
+	mulpd	%xmm2,%xmm4					# c4*x2
+	mulpd	%xmm3,%xmm5					# c4*x2
+
+	mulpd	%xmm2,%xmm8					# c2*x2
+	mulpd	%xmm3,%xmm9					# c2*x2
+
+	addpd	.Lsinarray+0x20(%rip),%xmm4			# c3+x2c4
+	addpd	.Lsinarray+0x20(%rip),%xmm5			# c3+x2c4
+
+	mulpd	%xmm10,%xmm2					# x3
+	mulpd	%xmm1,%xmm3					# x3
+
+	addpd	.Lsinarray(%rip),%xmm8			# c1+x2c2
+	addpd	.Lsinarray(%rip),%xmm9			# c1+x2c2
+
+	mulpd	%xmm0,%xmm4					# x4(c3+x2c4)
+	mulpd	%xmm11,%xmm5					# x4(c3+x2c4)
+
+	addpd	%xmm8,%xmm4					# zs
+	addpd	%xmm9,%xmm5					# zs
+
+	mulpd	%xmm2,%xmm4					# x3 * zs
+	mulpd	%xmm3,%xmm5					# x3 * zs
+
+	addpd	%xmm10,%xmm4					# +x
+	addpd	%xmm1,%xmm5					# +x
+
+	jmp 	.L__vrsa_sinf_cleanup

diff --git a/src/hypot.c b/src/hypot.c
new file mode 100644
index 0000000..063d526
--- /dev/null
+++ b/src/hypot.c

@@ -0,0 +1,223 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_SCALEDOUBLE_1
+#define USE_INFINITY_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_SCALEDOUBLE_1
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline double retval_errno_erange_overflow(double x, double y)
+{
+  struct exception exc;
+  exc.arg1 = x;
+  exc.arg2 = y;
+  exc.type = OVERFLOW;
+  exc.name = (char *)"hypot";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = HUGE;
+  else
+    exc.retval = infinity_with_flags(AMD_F_OVERFLOW | AMD_F_INEXACT);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(ERANGE);
+  else if (!matherr(&exc))
+    __set_errno(ERANGE);
+  return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+double FN_PROTOTYPE(hypot)(double x, double y)
+#else
+double FN_PROTOTYPE(hypot)(double x, double y)
+#endif
+{
+  /* Returns sqrt(x*x + y*y) with no overflow or underflow unless
+     the result warrants it */
+
+  const double large = 1.79769313486231570815e+308; /* 0x7fefffffffffffff */
+
+  double u, r, retval, hx, tx, x2, hy, ty, y2, hs, ts;
+  unsigned long long xexp, yexp, ux, uy, ut;
+  int dexp, expadjust;
+
+  GET_BITS_DP64(x, ux);
+  ux &= ~SIGNBIT_DP64;
+  GET_BITS_DP64(y, uy);
+  uy &= ~SIGNBIT_DP64;
+  xexp = (ux >> EXPSHIFTBITS_DP64);
+  yexp = (uy >> EXPSHIFTBITS_DP64);
+
+  if (xexp == BIASEDEMAX_DP64 + 1 || yexp == BIASEDEMAX_DP64 + 1)
+    {
+      /* One or both of the arguments are NaN or infinity. The
+         result will also be NaN or infinity. */
+      retval = x*x + y*y;
+      if (((xexp == BIASEDEMAX_DP64 + 1) && !(ux & MANTBITS_DP64)) ||
+          ((yexp == BIASEDEMAX_DP64 + 1) && !(uy & MANTBITS_DP64)))
+        /* x or y is infinity. ISO C99 defines that we must
+           return +infinity, even if the other argument is NaN.
+           Note that the computation of x*x + y*y above will already
+           have raised invalid if either x or y is a signalling NaN. */
+        return infinity_with_flags(0);
+      else
+        /* One or both of x or y is NaN, and neither is infinity.
+           Raise invalid if it's a signalling NaN */
+        return retval;
+    }
+
+  /* Set x = abs(x) and y = abs(y) */
+  PUT_BITS_DP64(ux, x);
+  PUT_BITS_DP64(uy, y);
+
+  /* The difference in exponents between x and y */
+  dexp = (int)(xexp - yexp);
+  expadjust = 0;
+
+  if (ux == 0)
+    /* x is zero */
+    return y;
+  else if (uy == 0)
+    /* y is zero */
+    return x;
+  else if (dexp > MANTLENGTH_DP64 + 1 || dexp < -MANTLENGTH_DP64 - 1)
+    /* One of x and y is insignificant compared to the other */
+    return x + y; /* Raise inexact */
+  else if (xexp > EXPBIAS_DP64 + 500 || yexp > EXPBIAS_DP64 + 500)
+    {
+      /* Danger of overflow; scale down by 2**600. */
+      expadjust = 600;
+      ux -= 0x2580000000000000;
+      PUT_BITS_DP64(ux, x);
+      uy -= 0x2580000000000000;
+      PUT_BITS_DP64(uy, y);
+    }
+  else if (xexp < EXPBIAS_DP64 - 500 || yexp < EXPBIAS_DP64 - 500)
+    {
+      /* Danger of underflow; scale up by 2**600. */
+      expadjust = -600;
+      if (xexp == 0)
+        {
+          /* x is denormal - handle by adding 601 to the exponent
+           and then subtracting a correction for the implicit bit */
+          PUT_BITS_DP64(ux + 0x2590000000000000, x);
+          x -= 9.23297861778573578076e-128; /* 0x2590000000000000 */
+          GET_BITS_DP64(x, ux);
+        }
+      else
+        {
+          /* x is normal - just increase the exponent by 600 */
+          ux += 0x2580000000000000;
+          PUT_BITS_DP64(ux, x);
+        }
+      if (yexp == 0)
+        {
+          PUT_BITS_DP64(uy + 0x2590000000000000, y);
+          y -= 9.23297861778573578076e-128; /* 0x2590000000000000 */
+          GET_BITS_DP64(y, uy);
+        }
+      else
+        {
+          uy += 0x2580000000000000;
+          PUT_BITS_DP64(uy, y);
+        }
+    }
+
+
+#ifdef FAST_BUT_GREATER_THAN_ONE_ULP
+  /* Not awful, but results in accuracy loss larger than 1 ulp */
+  r = x*x + y*y
+#else
+  /* Slower but more accurate */
+
+  /* Sort so that x is greater than y */
+  if (x < y)
+    {
+      u = y;
+      y = x;
+      x = u;
+      ut = ux;
+      ux = uy;
+      uy = ut;
+    }
+
+  /* Split x into hx and tx, head and tail */
+  PUT_BITS_DP64(ux & 0xfffffffff8000000, hx);
+  tx = x - hx;
+
+  PUT_BITS_DP64(uy & 0xfffffffff8000000, hy);
+  ty = y - hy;
+
+  /* Compute r = x*x + y*y with extra precision */
+  x2 = x*x;
+  y2 = y*y;
+  hs = x2 + y2;
+
+  if (dexp == 0)
+    /* We take most care when x and y have equal exponents,
+       i.e. are almost the same size */
+    ts = (((x2 - hs) + y2) +
+          ((hx * hx - x2) + 2 * hx * tx) + tx * tx) +
+      ((hy * hy - y2) + 2 * hy * ty) + ty * ty;
+  else
+    ts = (((x2 - hs) + y2) +
+          ((hx * hx - x2) + 2 * hx * tx) + tx * tx);
+
+  r = hs + ts;
+#endif
+
+  /* The sqrt can introduce another half ulp error. */
+#ifdef WINDOWS
+  /* VC++ intrinsic call */
+  _mm_store_sd(&retval, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&r)));
+#else
+  /* Hammer sqrt instruction */
+  asm volatile ("sqrtsd %1, %0" : "=x" (retval) : "x" (r));
+#endif
+
+  /* If necessary scale the result back. This may lead to
+     overflow but if so that's the correct result. */
+  retval = scaleDouble_1(retval, expadjust);
+
+  if (retval > large)
+    /* The result overflowed. Deal with errno. */
+#ifdef WINDOWS
+    return handle_error("hypot", PINFBITPATT_DP64, _OVERFLOW,
+                        AMD_F_OVERFLOW | AMD_F_INEXACT, ERANGE, x, y);
+#else
+    return retval_errno_erange_overflow(x, y);
+#endif
+
+  return retval;
+}
+
+weak_alias (__hypot, hypot)

diff --git a/src/hypotf.c b/src/hypotf.c
new file mode 100644
index 0000000..fcc09fc
--- /dev/null
+++ b/src/hypotf.c

@@ -0,0 +1,131 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#ifdef USE_SOFTWARE_SQRT
+#define USE_SQRTF_AMD_INLINE
+#endif
+#define USE_INFINITYF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#ifdef USE_SOFTWARE_SQRT
+#undef USE_SQRTF_AMD_INLINE
+#endif
+#undef USE_INFINITYF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline float retval_errno_erange_overflow(float x, float y)
+{
+  struct exception exc;
+  exc.arg1 = (double)x;
+  exc.arg2 = (double)y;
+  exc.type = OVERFLOW;
+  exc.name = (char *)"hypotf";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = HUGE;
+  else
+    exc.retval = infinityf_with_flags(AMD_F_OVERFLOW | AMD_F_INEXACT);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(ERANGE);
+  else if (!matherr(&exc))
+    __set_errno(ERANGE);
+  return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+float FN_PROTOTYPE(hypotf)(float x, float y)
+#else
+float FN_PROTOTYPE(hypotf)(float x, float y)
+#endif
+{
+  /* Returns sqrt(x*x + y*y) with no overflow or underflow unless
+     the result warrants it */
+
+    /* Do intermediate computations in double precision
+       and use sqrt instruction from chip if available. */
+    double dx = x, dy = y, dr, retval;
+
+    /* The largest finite float, stored as a double */
+    const double large = 3.40282346638528859812e+38; /* 0x47efffffe0000000 */
+
+
+  unsigned long long ux, uy, avx, avy;
+
+  GET_BITS_DP64(x, avx);
+  avx &= ~SIGNBIT_DP64;
+  GET_BITS_DP64(y, avy);
+  avy &= ~SIGNBIT_DP64;
+  ux = (avx >> EXPSHIFTBITS_DP64);
+  uy = (avy >> EXPSHIFTBITS_DP64);
+
+  if (ux == BIASEDEMAX_DP64 + 1 || uy == BIASEDEMAX_DP64 + 1)
+    {
+      retval = x*x + y*y;
+      /* One or both of the arguments are NaN or infinity. The
+         result will also be NaN or infinity. */
+      if (((ux == BIASEDEMAX_DP64 + 1) && !(avx & MANTBITS_DP64)) ||
+          ((uy == BIASEDEMAX_DP64 + 1) && !(avy & MANTBITS_DP64)))
+        /* x or y is infinity. ISO C99 defines that we must
+           return +infinity, even if the other argument is NaN.
+           Note that the computation of x*x + y*y above will already
+           have raised invalid if either x or y is a signalling NaN. */
+        return infinityf_with_flags(0);
+      else
+        /* One or both of x or y is NaN, and neither is infinity.
+           Raise invalid if it's a signalling NaN */
+        return (float)retval;
+    }
+
+    dr = (dx*dx + dy*dy);
+
+#if USE_SOFTWARE_SQRT
+    retval = sqrtf_amd_inline(r);
+#else
+#ifdef WINDOWS
+  /* VC++ intrinsic call */
+  _mm_store_sd(&retval, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&dr)));
+#else
+    /* Hammer sqrt instruction */
+    asm volatile ("sqrtsd %1, %0" : "=x" (retval) : "x" (dr));
+#endif
+#endif
+
+    if (retval > large)
+#ifdef WINDOWS
+      return handle_errorf("hypotf", PINFBITPATT_SP32, _OVERFLOW,
+                           AMD_F_OVERFLOW | AMD_F_INEXACT, ERANGE, x, y);
+#else
+      return retval_errno_erange_overflow(x, y);
+#endif
+    else
+      return (float)retval;
+  }
+
+weak_alias (__hypotf, hypotf)

diff --git a/src/ilogb.c b/src/ilogb.c
new file mode 100644
index 0000000..2c1cb7c
--- /dev/null
+++ b/src/ilogb.c

@@ -0,0 +1,99 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include <limits.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+
+int FN_PROTOTYPE(ilogb)(double x)
+{
+
+
+    /* Check for input range */
+    UT64 checkbits;
+    int expbits;
+    U64 manbits;
+	U64  zerovalue;
+	    /* Clear the sign bit and check if the value is zero nan or inf.*/
+	checkbits.f64=x;
+    zerovalue = (checkbits.u64 & ~SIGNBIT_DP64);
+
+    if(zerovalue == 0)
+    {
+        /* Raise exception as the number zero*/
+		__amd_handle_error(DOMAIN, EDOM, "ilogb", x, 0.0 ,(double)INT_MIN);
+		
+
+         return INT_MIN;
+    }
+
+    if( zerovalue == EXPBITS_DP64 )
+    {
+        /* Raise exception as the number is inf */
+
+		__amd_handle_error(DOMAIN, EDOM, "ilogb", x, 0.0 ,(double)INT_MAX);
+		
+        return INT_MAX;
+    }
+
+    if( zerovalue > EXPBITS_DP64 )
+    {
+		/* Raise exception as the number is nan */
+		__amd_handle_error(DOMAIN, EDOM, "ilogb", x, 0.0 ,(double)INT_MIN);
+		
+
+		return INT_MIN;
+    }
+
+    expbits = (int) (( checkbits.u64 << 1) >> 53);
+
+    if(expbits == 0 && (checkbits.u64 & MANTBITS_DP64 )!= 0)
+    {
+        /* the value is denormalized */
+      manbits = checkbits.u64 & MANTBITS_DP64;
+      expbits = EMIN_DP64;
+      while (manbits < IMPBIT_DP64)
+        {
+          manbits <<= 1;
+          expbits--;
+        }
+    }
+    else
+	{
+
+		expbits-=EXPBIAS_DP64;
+	}
+
+
+    return expbits;
+}

diff --git a/src/ilogbf.c b/src/ilogbf.c
new file mode 100644
index 0000000..cb129e6
--- /dev/null
+++ b/src/ilogbf.c

@@ -0,0 +1,109 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include <limits.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+int FN_PROTOTYPE(ilogbf)(float x)
+{
+
+    /* Check for input range */
+    UT32 checkbits;
+    int expbits;
+    U32 manbits;
+	U32 zerovalue;
+    checkbits.f32=x;
+
+    /* Clear the sign bit and check if the value is zero nan or inf.*/
+    zerovalue = (checkbits.u32 & ~SIGNBIT_SP32);
+
+    if(zerovalue == 0)
+    {
+		/* Raise exception as the number zero*/
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "ilogb", x, is_x_snan, 0.0f, 0, (float) INT_MIN, 0);
+        }
+		
+		return INT_MIN;
+    }
+
+    if( zerovalue == EXPBITS_SP32 )
+    {
+		/* Raise exception as the number is inf */
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "ilogb", x, is_x_snan, 0.0f, 0, (float) INT_MAX, 0);
+        }
+		
+		return INT_MAX;
+    }
+
+    if( zerovalue > EXPBITS_SP32 )
+    {
+		/* Raise exception as the number is nan */
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "ilogb", x, is_x_snan, 0.0f, 0, (float) INT_MIN, 0);
+        }
+        
+		return INT_MIN;
+    }
+
+    expbits = (int) (( checkbits.u32 << 1) >> 24);
+
+    if(expbits == 0 && (checkbits.u32 & MANTBITS_SP32 )!= 0)
+    {
+        /* the value is denormalized */
+      manbits = checkbits.u32 & MANTBITS_SP32;
+      expbits = EMIN_SP32;
+      while (manbits < IMPBIT_SP32)
+        {
+          manbits <<= 1;
+          expbits--;
+        }
+    }
+    else
+	{
+		expbits-=EXPBIAS_SP32;
+	}
+
+
+    return expbits;
+}

diff --git a/src/ldexp.c b/src/ldexp.c
new file mode 100644
index 0000000..695118b
--- /dev/null
+++ b/src/ldexp.c

@@ -0,0 +1,117 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+double FN_PROTOTYPE(ldexp)(double x, int n)
+{
+    UT64 val;
+    unsigned int sign;
+    int exponent;
+    val.f64 = x;
+    sign = val.u32[1] & 0x80000000;
+    val.u32[1] = val.u32[1] & 0x7fffffff; /* remove the sign bit */
+
+    if((val.u32[1] & 0x7ff00000)== 0x7ff00000)/* x= nan or x = +-inf*/
+        return x;
+
+    if((val.u64 == 0x0000000000000000) || (n==0))
+        return x; /* x= +-0 or n= 0*/
+
+    exponent = val.u32[1] >> 20; /* get the exponent */
+
+    if(exponent == 0)/*x is denormal*/
+    {
+		val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/
+        exponent = val.u32[1] >> 20; /* get the exponent */
+		exponent = exponent + n - MULTIPLIER_DP;
+		if(exponent < -MULTIPLIER_DP)/*underflow*/
+		{
+			val.u32[1] = sign | 0x00000000;
+			val.u32[0] = 0x00000000;
+			__amd_handle_error(UNDERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64);
+			
+			return val.f64;
+		}
+		if(exponent > 2046)/*overflow*/
+		{
+			val.u32[1] = sign | 0x7ff00000;
+			val.u32[0] = 0x00000000;
+			__amd_handle_error(OVERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64);
+			
+			return val.f64;
+		}
+
+		exponent += MULTIPLIER_DP;
+		val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+		val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+        return val.f64;
+    }
+
+    exponent += n;
+
+    if(exponent < -MULTIPLIER_DP)/*underflow*/
+	{
+		val.u32[1] = sign | 0x00000000;
+		val.u32[0] = 0x00000000;
+
+		__amd_handle_error(UNDERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64);
+		
+
+		return val.f64;
+	}
+
+    if(exponent < 1)/*x is normal but output is debnormal*/
+    {
+		exponent += MULTIPLIER_DP;
+		val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+		val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+        return val.f64;
+    }
+
+    if(exponent > 2046)/*overflow*/
+	{
+		val.u32[1] = sign | 0x7ff00000;
+		val.u32[0] = 0x00000000;
+		__amd_handle_error(OVERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64);
+		
+
+		return val.f64;
+	}
+
+    val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+    return val.f64;
+}
+
+
+

diff --git a/src/ldexpf.c b/src/ldexpf.c
new file mode 100644
index 0000000..892c6e9
--- /dev/null
+++ b/src/ldexpf.c

@@ -0,0 +1,133 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#endif
+
+#include <math.h>
+#include <errno.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+
+float FN_PROTOTYPE(ldexpf)(float x, int n)
+{
+    UT32 val;
+    unsigned int sign;
+    int exponent;
+    val.f32 = x;
+    sign = val.u32 & 0x80000000;
+    val.u32 = val.u32 & 0x7fffffff;/* remove the sign bit */
+
+    if((val.u32 & 0x7f800000)== 0x7f800000)/* x= nan or x = +-inf*/
+        return x;
+
+    if((val.u32 == 0x00000000) || (n==0))/* x= +-0 or n= 0*/
+        return x;
+
+    exponent = val.u32 >> 23; /* get the exponent */
+
+	if(exponent == 0)/*x is denormal*/
+	{
+		val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/
+		exponent = (val.u32 >> 23); /* get the exponent */
+		exponent = exponent + n - MULTIPLIER_SP;
+		if(exponent < -MULTIPLIER_SP)/*underflow*/
+		{
+			val.u32 = sign | 0x00000000;
+
+            {
+                unsigned int is_x_snan;
+                UT32 xm; xm.f32 = x;
+                is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+                __amd_handle_errorf(UNDERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0);
+            }
+			
+			return val.f32;
+		}
+		if(exponent > 254)/*overflow*/
+		{
+			val.u32 = sign | 0x7f800000;
+
+            {
+                unsigned int is_x_snan;
+                UT32 xm; xm.f32 = x;
+                is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+                __amd_handle_errorf(OVERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0);
+            }
+
+			
+			return val.f32;
+		}
+
+		exponent += MULTIPLIER_SP;
+		val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+		val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+        return val.f32;
+	}
+
+    exponent += n;
+
+    if(exponent < -MULTIPLIER_SP)/*underflow*/
+	{
+		val.u32 = sign | 0x00000000;
+
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(UNDERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0);
+        }
+
+		return val.f32;
+	}
+
+    if(exponent < 1)/*x is normal but output is debnormal*/
+    {
+		exponent += MULTIPLIER_SP;
+		val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+		val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+        return val.f32;
+    }
+
+    if(exponent > 254)/*overflow*/
+	{
+        val.u32 = sign | 0x7f800000;
+
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(OVERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0);
+        }
+
+        return val.f32;
+	}
+
+    val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);/*x is normal and output is normal*/
+    return val.f32;
+}
+

diff --git a/src/libm_special.c b/src/libm_special.c
new file mode 100644
index 0000000..974d99b
--- /dev/null
+++ b/src/libm_special.c

@@ -0,0 +1,117 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+#ifdef __x86_64__
+
+#include <emmintrin.h>
+#include <math.h>
+#include <errno.h>
+
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+#ifdef WIN64
+#define EXCEPTION_S _exception
+#else
+#define EXCEPTION_S exception
+#endif
+
+
+
+static double convert_snan_32to64(float x)
+{
+    U64 t;
+    UT32 xs;
+    UT64 xb;
+    
+    xs.f32 = x;
+    xb.u64 = (((xs.u32 & SIGNBIT_SP32) == SIGNBIT_SP32) ? NINFBITPATT_DP64 : EXPBITS_DP64);
+    
+    t = 0;
+    t = (xs.u32 & MANTBITS_SP32);
+    t = (t << 29); // 29 = (52-23)
+    xb.u64 = (xb.u64 | t);
+
+    return xb.f64;
+}
+
+#ifdef NEED_FAKE_MATHERR
+int
+matherr (struct exception *s)
+{
+  return 0;
+}
+#endif
+
+void __amd_handle_errorf(int type, int error, const char *name,
+                    float arg1, unsigned int arg1_is_snan,
+                    float arg2, unsigned int arg2_is_snan,
+                    float retval, unsigned int retval_is_snan)
+{
+    struct EXCEPTION_S exception_data;
+
+    // write exception info
+    exception_data.type = type;
+    exception_data.name = (char*)name;
+
+    // sNaN float to double conversion can trigger interrupt
+    // handle them specially
+
+    if(arg1_is_snan)    { exception_data.arg1 = convert_snan_32to64(arg1); }
+    else                { exception_data.arg1 = (double)arg1; }
+
+    if(arg2_is_snan)    { exception_data.arg2 = convert_snan_32to64(arg2); }
+    else                { exception_data.arg2 = (double)arg2; }
+
+    if(retval_is_snan)  { exception_data.retval = convert_snan_32to64(retval); }
+    else                { exception_data.retval = (double)retval; }
+
+    // call matherr, set errno if matherr returns 0
+    if(!matherr(&exception_data))
+    {
+        errno = error;
+    }
+}
+
+void __amd_handle_error(int type, int error, const char *name,
+                   double arg1,
+                   double arg2,
+                   double retval)
+{
+    struct EXCEPTION_S exception_data;
+
+    // write exception info
+    exception_data.type = type;
+    exception_data.name = (char*)name;
+
+    exception_data.arg1 = arg1;
+    exception_data.arg2 = arg2;
+    exception_data.retval = retval;
+
+    // call matherr, set errno if matherr returns 0
+    if(!matherr(&exception_data))
+    {
+        errno = error;
+    }
+}
+
+#endif /* __x86_64__ */

diff --git a/src/llrint.c b/src/llrint.c
new file mode 100644
index 0000000..5f96115
--- /dev/null
+++ b/src/llrint.c

@@ -0,0 +1,62 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+long long int FN_PROTOTYPE(llrint)(double x)
+{
+
+
+    UT64 checkbits,val_2p52;
+    checkbits.f64=x;
+
+    /* Clear the sign bit and check if the value can be rounded */
+
+    if( (checkbits.u64 & 0x7FFFFFFFFFFFFFFF) > 0x4330000000000000)
+    {
+        /* number cant be rounded raise an exception */
+        /* Number exceeds the representable range could be nan or inf also*/
+        __amd_handle_error(DOMAIN, EDOM, "llrint", x,0.0 ,(double)x);
+
+	  	return (long long int) x;
+    }
+
+    val_2p52.u32[1] = (checkbits.u32[1] & 0x80000000) | 0x43300000;
+    val_2p52.u32[0] = 0;
+
+
+	/* Add and sub 2^52 to round the number according to the current rounding direction */
+
+    return (long long int) ((x + val_2p52.f64) - val_2p52.f64);
+}

diff --git a/src/llrintf.c b/src/llrintf.c
new file mode 100644
index 0000000..509e46b
--- /dev/null
+++ b/src/llrintf.c

@@ -0,0 +1,67 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+long long int FN_PROTOTYPE(llrintf)(float x)
+{
+
+    UT32 checkbits,val_2p23;
+    checkbits.f32=x;
+
+    /* Clear the sign bit and check if the value can be rounded */
+
+    if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000)
+    {
+        /* number cant be rounded raise an exception */
+        /* Number exceeds the representable range could be nan or inf also*/
+
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "llrintf", x, is_x_snan, 0.0F , 0,(float)x, 0);
+        }
+
+	    return (long long int) x;
+    }
+
+
+    val_2p23.u32 = (checkbits.u32 & 0x80000000) | 0x4B000000;
+
+   /* Add and sub 2^23 to round the number according to the current rounding direction */
+
+    return (long long int) ((x + val_2p23.f32) - val_2p23.f32);
+}

diff --git a/src/llround.c b/src/llround.c
new file mode 100644
index 0000000..0b582c2
--- /dev/null
+++ b/src/llround.c

@@ -0,0 +1,112 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+#ifdef WINDOWS
+/*In windows llong long int is 64 bit and long int is 32 bit.
+  In Linux long long int and long int both are of size 64 bit*/
+long long int FN_PROTOTYPE(llround)(double d)
+{
+    UT64 u64d;
+    UT64 u64Temp,u64result;
+    int intexp, shift;
+    U64 sign;
+    long long int result;
+
+    u64d.f64 = u64Temp.f64 = d;
+
+    if ((u64d.u32[1] & 0X7FF00000) == 0x7FF00000)
+    {
+        /*else the number is infinity*/
+        //Got to raise range or domain error
+            __amd_handle_error(DOMAIN, EDOM, "llround", d, 0.0 , (double)SIGNBIT_DP64);
+			return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/
+    }
+
+    u64Temp.u32[1] &= 0x7FFFFFFF;
+    intexp = (u64d.u32[1] & 0x7FF00000) >> 20;
+    sign = u64d.u64 & 0x8000000000000000;
+    intexp -= 0x3FF;
+
+    /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */
+    if (intexp < -1)
+        return (0);
+
+    /* 1.0 x 2^31 (or 2^63) is already too large */
+    if (intexp >= 63)
+    {
+        /*Based on the sign of the input value return the MAX and MIN*/
+        result = 0x8000000000000000; /*Return LONG MIN*/
+        __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double) result);
+
+        return result;
+    }
+
+    u64result.f64 = u64Temp.f64;
+    /* >= 2^52 is already an exact integer */
+    if (intexp < 52)
+    {
+        /* add 0.5, extraction below will truncate */
+        u64result.f64 = u64Temp.f64 + 0.5;
+    }
+
+    intexp = ((u64result.u32[1] >> 20) & 0x7ff) - 0x3FF;
+
+    u64result.u32[1] &= 0xfffff;
+    u64result.u32[1] |= 0x00100000; /*Mask the last exp bit to 1*/
+    shift = intexp - 52;
+
+    if(shift < 0)
+        u64result.u64 = u64result.u64 >> (-shift);
+    if(shift > 0)
+        u64result.u64 = u64result.u64 << (shift);
+
+    result = u64result.u64;
+
+    if (sign)
+        result = -result;
+
+    return result;
+}
+
+#else //WINDOWS 
+/*llroundf is equivalent to the linux implementation of 
+  lroundf. Both long int and long long int are of the same size*/
+long long int FN_PROTOTYPE(llround)(double d)
+{
+    long long int result;
+    result = FN_PROTOTYPE(lround)(d);
+    return result;
+}
+#endif

diff --git a/src/llroundf.c b/src/llroundf.c
new file mode 100644
index 0000000..0e1ac8a
--- /dev/null
+++ b/src/llroundf.c

@@ -0,0 +1,132 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+#ifdef WINDOWS
+/*In windows llong long int is 64 bit and long int is 32 bit.
+  In Linux long long int and long int both are of size 64 bit*/
+long long int FN_PROTOTYPE(llroundf)(float f)
+{
+    UT32 u32d;
+    UT32 u32Temp,u32result;
+    int intexp, shift;
+    U32 sign;
+    long long int  result;
+
+    u32d.f32 = u32Temp.f32 = f;
+    if ((u32d.u32 & 0X7F800000) == 0x7F800000)
+    {
+        /*else the number is infinity*/
+		//Got to raise range or domain error
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = f;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "llroundf", f, is_x_snan, 0.0F , 0,(float)SIGNBIT_DP64, 0);
+			return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/
+        }
+
+    }
+
+    u32Temp.u32 &= 0x7FFFFFFF;
+    intexp = (u32d.u32 & 0x7F800000) >> 23;
+    sign = u32d.u32 & 0x80000000;
+    intexp -= 0x7F;
+
+
+    /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */
+    if (intexp < -1)
+        return (0);
+
+
+    /* 1.0 x 2^31 (or 2^63) is already too large */
+    if (intexp >= 63)
+    {
+        result = 0x8000000000000000;
+            
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = f;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)result, 0);
+        }
+
+        return result;
+    }
+
+    u32result.f32 = u32Temp.f32;
+
+    /* >= 2^52 is already an exact integer */
+    if (intexp < 23)
+    {
+        /* add 0.5, extraction below will truncate */
+        u32result.f32 = u32Temp.f32 + 0.5F;
+    }
+    intexp = (u32result.u32 & 0x7f800000) >> 23;
+    intexp -= 0x7f;
+    u32result.u32 &= 0x7fffff;
+    u32result.u32 |= 0x00800000;
+
+    result = u32result.u32;
+
+    /*Since float is only 32 bit for higher accuracy we shift the result by 32 bits
+     * In the next step we shift an extra 32 bits in the reverse direction based
+     * on the value of intexp*/
+    result = result << 32;
+    shift = intexp - 55; /*55= 23 +32*/
+
+
+	if(shift < 0)
+		result = result >> (-shift);
+	if(shift > 0)
+        result = result << (shift);
+
+    if (sign)
+        result = -result;
+    return result;
+
+}
+
+#else //WINDOWS
+/*llroundf is equivalent to the linux implementation of 
+  lroundf. Both long int and long long int are of the same size*/
+long long int FN_PROTOTYPE(llroundf)(float f)
+{
+    long long int result;
+    result = FN_PROTOTYPE(lroundf)(f);
+    return result;
+
+}
+#endif
+

diff --git a/src/log1p.c b/src/log1p.c
new file mode 100644
index 0000000..b7cd097
--- /dev/null
+++ b/src/log1p.c

@@ -0,0 +1,475 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_NAN_WITH_FLAGS
+#define USE_VAL_WITH_FLAGS
+#define USE_INFINITY_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline double retval_errno_erange_overflow(double x)
+{
+  struct exception exc;
+  exc.arg1 = x;
+  exc.arg2 = x;
+  exc.type = SING;
+  exc.name = (char *)"log1p";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = -HUGE;
+  else
+    exc.retval = -infinity_with_flags(AMD_F_DIVBYZERO);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(ERANGE);
+  else if (!matherr(&exc))
+    __set_errno(ERANGE);
+  return exc.retval;
+}
+
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x)
+{
+  struct exception exc;
+  exc.arg1 = x;
+  exc.arg2 = x;
+  exc.type = DOMAIN;
+  exc.name = (char *)"log1p";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = -HUGE;
+  else
+    exc.retval = nan_with_flags(AMD_F_INVALID);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("log1p: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "log1p"
+
+double FN_PROTOTYPE(log1p)(double x)
+{
+
+  int xexp;
+  double r, r1, r2, correction, f, f1, f2, q, u, v, z1, z2, poly, m2;
+  int index;
+  unsigned long long ux, ax;
+
+  /*
+    Computes natural log(1+x). Algorithm based on:
+    Ping-Tak Peter Tang
+    "Table-driven implementation of the logarithm function in IEEE
+    floating-point arithmetic"
+    ACM Transactions on Mathematical Software (TOMS)
+    Volume 16, Issue 4 (December 1990)
+    Note that we use a lookup table of size 64 rather than 128,
+    and compensate by having extra terms in the minimax polynomial
+    for the kernel approximation.
+  */
+
+/* Arrays ln_lead_table and ln_tail_table contain
+   leading and trailing parts respectively of precomputed
+   values of natural log(1+i/64), for i = 0, 1, ..., 64.
+   ln_lead_table contains the first 24 bits of precision,
+   and ln_tail_table contains a further 53 bits precision. */
+
+  static const double ln_lead_table[65] = {
+    0.00000000000000000000e+00,   /* 0x0000000000000000 */
+    1.55041813850402832031e-02,   /* 0x3f8fc0a800000000 */
+    3.07716131210327148438e-02,   /* 0x3f9f829800000000 */
+    4.58095073699951171875e-02,   /* 0x3fa7745800000000 */
+    6.06245994567871093750e-02,   /* 0x3faf0a3000000000 */
+    7.52233862876892089844e-02,   /* 0x3fb341d700000000 */
+    8.96121263504028320312e-02,   /* 0x3fb6f0d200000000 */
+    1.03796780109405517578e-01,   /* 0x3fba926d00000000 */
+    1.17783010005950927734e-01,   /* 0x3fbe270700000000 */
+    1.31576299667358398438e-01,   /* 0x3fc0d77e00000000 */
+    1.45181953907012939453e-01,   /* 0x3fc2955280000000 */
+    1.58604979515075683594e-01,   /* 0x3fc44d2b00000000 */
+    1.71850204467773437500e-01,   /* 0x3fc5ff3000000000 */
+    1.84922337532043457031e-01,   /* 0x3fc7ab8900000000 */
+    1.97825729846954345703e-01,   /* 0x3fc9525a80000000 */
+    2.10564732551574707031e-01,   /* 0x3fcaf3c900000000 */
+    2.23143517971038818359e-01,   /* 0x3fcc8ff780000000 */
+    2.35566020011901855469e-01,   /* 0x3fce270700000000 */
+    2.47836112976074218750e-01,   /* 0x3fcfb91800000000 */
+    2.59957492351531982422e-01,   /* 0x3fd0a324c0000000 */
+    2.71933674812316894531e-01,   /* 0x3fd1675c80000000 */
+    2.83768117427825927734e-01,   /* 0x3fd22941c0000000 */
+    2.95464158058166503906e-01,   /* 0x3fd2e8e280000000 */
+    3.07025015354156494141e-01,   /* 0x3fd3a64c40000000 */
+    3.18453729152679443359e-01,   /* 0x3fd4618bc0000000 */
+    3.29753279685974121094e-01,   /* 0x3fd51aad80000000 */
+    3.40926527976989746094e-01,   /* 0x3fd5d1bd80000000 */
+    3.51976394653320312500e-01,   /* 0x3fd686c800000000 */
+    3.62905442714691162109e-01,   /* 0x3fd739d7c0000000 */
+    3.73716354370117187500e-01,   /* 0x3fd7eaf800000000 */
+    3.84411692619323730469e-01,   /* 0x3fd89a3380000000 */
+    3.94993782043457031250e-01,   /* 0x3fd9479400000000 */
+    4.05465066432952880859e-01,   /* 0x3fd9f323c0000000 */
+    4.15827870368957519531e-01,   /* 0x3fda9cec80000000 */
+    4.26084339618682861328e-01,   /* 0x3fdb44f740000000 */
+    4.36236739158630371094e-01,   /* 0x3fdbeb4d80000000 */
+    4.46287095546722412109e-01,   /* 0x3fdc8ff7c0000000 */
+    4.56237375736236572266e-01,   /* 0x3fdd32fe40000000 */
+    4.66089725494384765625e-01,   /* 0x3fddd46a00000000 */
+    4.75845873355865478516e-01,   /* 0x3fde744240000000 */
+    4.85507786273956298828e-01,   /* 0x3fdf128f40000000 */
+    4.95077252388000488281e-01,   /* 0x3fdfaf5880000000 */
+    5.04556000232696533203e-01,   /* 0x3fe02552a0000000 */
+    5.13945698738098144531e-01,   /* 0x3fe0723e40000000 */
+    5.23248136043548583984e-01,   /* 0x3fe0be72e0000000 */
+    5.32464742660522460938e-01,   /* 0x3fe109f380000000 */
+    5.41597247123718261719e-01,   /* 0x3fe154c3c0000000 */
+    5.50647079944610595703e-01,   /* 0x3fe19ee6a0000000 */
+    5.59615731239318847656e-01,   /* 0x3fe1e85f40000000 */
+    5.68504691123962402344e-01,   /* 0x3fe23130c0000000 */
+    5.77315330505371093750e-01,   /* 0x3fe2795e00000000 */
+    5.86049020290374755859e-01,   /* 0x3fe2c0e9e0000000 */
+    5.94707071781158447266e-01,   /* 0x3fe307d720000000 */
+    6.03290796279907226562e-01,   /* 0x3fe34e2880000000 */
+    6.11801505088806152344e-01,   /* 0x3fe393e0c0000000 */
+    6.20240390300750732422e-01,   /* 0x3fe3d90260000000 */
+    6.28608644008636474609e-01,   /* 0x3fe41d8fe0000000 */
+    6.36907458305358886719e-01,   /* 0x3fe4618bc0000000 */
+    6.45137906074523925781e-01,   /* 0x3fe4a4f840000000 */
+    6.53301239013671875000e-01,   /* 0x3fe4e7d800000000 */
+    6.61398470401763916016e-01,   /* 0x3fe52a2d20000000 */
+    6.69430613517761230469e-01,   /* 0x3fe56bf9c0000000 */
+    6.77398800849914550781e-01,   /* 0x3fe5ad4040000000 */
+    6.85303986072540283203e-01,   /* 0x3fe5ee02a0000000 */
+    6.93147122859954833984e-01};  /* 0x3fe62e42e0000000 */
+
+  static const double ln_tail_table[65] = {
+    0.00000000000000000000e+00,   /* 0x0000000000000000 */
+    5.15092497094772879206e-09,   /* 0x3e361f807c79f3db */
+    4.55457209735272790188e-08,   /* 0x3e6873c1980267c8 */
+    2.86612990859791781788e-08,   /* 0x3e5ec65b9f88c69e */
+    2.23596477332056055352e-08,   /* 0x3e58022c54cc2f99 */
+    3.49498983167142274770e-08,   /* 0x3e62c37a3a125330 */
+    3.23392843005887000414e-08,   /* 0x3e615cad69737c93 */
+    1.35722380472479366661e-08,   /* 0x3e4d256ab1b285e9 */
+    2.56504325268044191098e-08,   /* 0x3e5b8abcb97a7aa2 */
+    5.81213608741512136843e-08,   /* 0x3e6f34239659a5dc */
+    5.59374849578288093334e-08,   /* 0x3e6e07fd48d30177 */
+    5.06615629004996189970e-08,   /* 0x3e6b32df4799f4f6 */
+    5.24588857848400955725e-08,   /* 0x3e6c29e4f4f21cf8 */
+    9.61968535632653505972e-10,   /* 0x3e1086c848df1b59 */
+    1.34829655346594463137e-08,   /* 0x3e4cf456b4764130 */
+    3.65557749306383026498e-08,   /* 0x3e63a02ffcb63398 */
+    3.33431709374069198903e-08,   /* 0x3e61e6a6886b0976 */
+    5.13008650536088382197e-08,   /* 0x3e6b8abcb97a7aa2 */
+    5.09285070380306053751e-08,   /* 0x3e6b578f8aa35552 */
+    3.20853940845502057341e-08,   /* 0x3e6139c871afb9fc */
+    4.06713248643004200446e-08,   /* 0x3e65d5d30701ce64 */
+    5.57028186706125221168e-08,   /* 0x3e6de7bcb2d12142 */
+    5.48356693724804282546e-08,   /* 0x3e6d708e984e1664 */
+    1.99407553679345001938e-08,   /* 0x3e556945e9c72f36 */
+    1.96585517245087232086e-09,   /* 0x3e20e2f613e85bda */
+    6.68649386072067321503e-09,   /* 0x3e3cb7e0b42724f6 */
+    5.89936034642113390002e-08,   /* 0x3e6fac04e52846c7 */
+    2.85038578721554472484e-08,   /* 0x3e5e9b14aec442be */
+    5.09746772910284482606e-08,   /* 0x3e6b5de8034e7126 */
+    5.54234668933210171467e-08,   /* 0x3e6dc157e1b259d3 */
+    6.29100830926604004874e-09,   /* 0x3e3b05096ad69c62 */
+    2.61974119468563937716e-08,   /* 0x3e5c2116faba4cdd */
+    4.16752115011186398935e-08,   /* 0x3e665fcc25f95b47 */
+    2.47747534460820790327e-08,   /* 0x3e5a9a08498d4850 */
+    5.56922172017964209793e-08,   /* 0x3e6de647b1465f77 */
+    2.76162876992552906035e-08,   /* 0x3e5da71b7bf7861d */
+    7.08169709942321478061e-09,   /* 0x3e3e6a6886b09760 */
+    5.77453510221151779025e-08,   /* 0x3e6f0075eab0ef64 */
+    4.43021445893361960146e-09,   /* 0x3e33071282fb989b */
+    3.15140984357495864573e-08,   /* 0x3e60eb43c3f1bed2 */
+    2.95077445089736670973e-08,   /* 0x3e5faf06ecb35c84 */
+    1.44098510263167149349e-08,   /* 0x3e4ef1e63db35f68 */
+    1.05196987538551827693e-08,   /* 0x3e469743fb1a71a5 */
+    5.23641361722697546261e-08,   /* 0x3e6c1cdf404e5796 */
+    7.72099925253243069458e-09,   /* 0x3e4094aa0ada625e */
+    5.62089493829364197156e-08,   /* 0x3e6e2d4c96fde3ec */
+    3.53090261098577946927e-08,   /* 0x3e62f4d5e9a98f34 */
+    3.80080516835568242269e-08,   /* 0x3e6467c96ecc5cbe */
+    5.66961038386146408282e-08,   /* 0x3e6e7040d03dec5a */
+    4.42287063097349852717e-08,   /* 0x3e67bebf4282de36 */
+    3.45294525105681104660e-08,   /* 0x3e6289b11aeb783f */
+    2.47132034530447431509e-08,   /* 0x3e5a891d1772f538 */
+    3.59655343422487209774e-08,   /* 0x3e634f10be1fb591 */
+    5.51581770357780862071e-08,   /* 0x3e6d9ce1d316eb93 */
+    3.60171867511861372793e-08,   /* 0x3e63562a19a9c442 */
+    1.94511067964296180547e-08,   /* 0x3e54e2adf548084c */
+    1.54137376631349347838e-08,   /* 0x3e508ce55cc8c97a */
+    3.93171034490174464173e-09,   /* 0x3e30e2f613e85bda */
+    5.52990607758839766440e-08,   /* 0x3e6db03ebb0227bf */
+    3.29990737637586136511e-08,   /* 0x3e61b75bb09cb098 */
+    1.18436010922446096216e-08,   /* 0x3e496f16abb9df22 */
+    4.04248680368301346709e-08,   /* 0x3e65b3f399411c62 */
+    2.27418915900284316293e-08,   /* 0x3e586b3e59f65355 */
+    1.70263791333409206020e-08,   /* 0x3e52482ceae1ac12 */
+    5.76999904754328540596e-08};  /* 0x3e6efa39ef35793c */
+
+  /* log2_lead and log2_tail sum to an extra-precise version
+     of log(2) */
+  static const double
+    log2_lead = 6.93147122859954833984e-01,  /* 0x3fe62e42e0000000 */
+    log2_tail = 5.76999904754328540596e-08;  /* 0x3e6efa39ef35793c */
+
+  static const double
+  /* Approximating polynomial coefficients for x near 0.0 */
+    ca_1 = 8.33333333333317923934e-02,  /* 0x3fb55555555554e6 */
+    ca_2 = 1.25000000037717509602e-02,  /* 0x3f89999999bac6d4 */
+    ca_3 = 2.23213998791944806202e-03,  /* 0x3f62492307f1519f */
+    ca_4 = 4.34887777707614552256e-04,  /* 0x3f3c8034c85dfff0 */
+
+  /* Approximating polynomial coefficients for other x */
+    cb_1 = 8.33333333333333593622e-02,  /* 0x3fb5555555555557 */
+    cb_2 = 1.24999999978138668903e-02,  /* 0x3f89999999865ede */
+    cb_3 = 2.23219810758559851206e-03;  /* 0x3f6249423bd94741 */
+
+  /* The values exp(-1/16)-1 and exp(1/16)-1 */
+  static const double
+    log1p_thresh1 = -6.05869371865242201114e-02, /* 0xbfaf0540438fd5c4 */
+    log1p_thresh2 = 6.44944589178594318568e-02;  /* 0x3fb082b577d34ed8 */
+
+
+  GET_BITS_DP64(x, ux);
+  ax = ux & ~SIGNBIT_DP64;
+
+  if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_DP64)
+        {
+          /* x is NaN */
+#ifdef WINDOWS
+          return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN,
+                              0, EDOM, x, 0.0);
+#else
+          return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+        }
+      else
+        {
+          /* x is infinity */
+          if (ux & SIGNBIT_DP64)
+            /* x is negative infinity. Return a NaN. */
+#ifdef WINDOWS
+            return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+                                AMD_F_INVALID, EDOM, x, 0.0);
+#else
+            return retval_errno_edom(x);
+#endif
+          else
+            return x;
+        }
+    }
+  else if (ux >= 0xbff0000000000000)
+    {
+      /* x <= -1.0 */
+      if (ux > 0xbff0000000000000)
+        {
+          /* x is less than -1.0. Return a NaN. */
+#ifdef WINDOWS
+          return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+                              AMD_F_INVALID, EDOM, x, 0.0);
+#else
+          return retval_errno_edom(x);
+#endif
+        }
+      else
+        {
+          /* x is exactly -1.0. Return -infinity with div-by-zero flag. */
+#ifdef WINDOWS
+          return handle_error(_FUNCNAME, NINFBITPATT_DP64, _SING,
+                              AMD_F_DIVBYZERO, ERANGE, x, 0.0);
+#else
+          return retval_errno_erange_overflow(x);
+#endif
+        }
+    }
+  else if (ax < 0x3ca0000000000000)
+    {
+      if (ax == 0x0000000000000000)
+        {
+          /* x is +/-zero. Return the same zero. */
+          return x;
+        }
+      else
+        /* abs(x) is less than epsilon. Return x with inexact. */
+        return val_with_flags(x, AMD_F_INEXACT);
+    }
+
+
+  if (x < log1p_thresh1 || x > log1p_thresh2)
+    {
+      /* x is outside the range [exp(-1/16)-1, exp(1/16)-1] */
+      /*
+        First, we decompose the argument x to the form
+        1 + x  =  2**M  *  (F1  +  F2),
+        where  1 <= F1+F2 < 2, M has the value of an integer,
+        F1 = 1 + j/64, j ranges from 0 to 64, and |F2| <= 1/128.
+
+        Second, we approximate log( 1 + F2/F1 ) by an odd polynomial
+        in U, where U  =  2 F2 / (2 F1 + F2).
+        Note that log( 1 + F2/F1 ) = log( 1 + U/2 ) - log( 1 - U/2 ).
+        The core approximation calculates
+        Poly = [log( 1 + U/2 ) - log( 1 - U/2 )]/U   -   1.
+        Note that  log(1 + U/2) - log(1 - U/2) = 2 arctanh ( U/2 ),
+        thus, Poly =  2 arctanh( U/2 ) / U  -  1.
+
+        It is not hard to see that
+          log(x) = M*log(2) + log(F1) + log( 1 + F2/F1 ).
+        Hence, we return Z1 = log(F1), and  Z2 = log( 1 + F2/F1).
+        The values of log(F1) are calculated beforehand and stored
+        in the program.
+      */
+
+      f = 1.0 + x;
+      GET_BITS_DP64(f, ux);
+
+      /* Store the exponent of x in xexp and put
+         f into the range [1.0,2.0) */
+      xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+      PUT_BITS_DP64((ux & MANTBITS_DP64) | ONEEXPBITS_DP64, f);
+
+      /* Now  (1+x) = 2**(xexp)  * f,  1 <= f < 2. */
+
+      /* Set index to be the nearest integer to 64*f */
+      /* 64 <= index <= 128 */
+      /*
+        r = 64.0 * f;
+        index = (int)(r + 0.5);
+      */
+      /* This code instead of the above can save several cycles.
+         It only works because 64 <= r < 128, so
+         the nearest integer is always contained in exactly
+         7 bits, and the right shift is always the same. */
+      index = (int)((((ux & 0x000fc00000000000) | 0x0010000000000000) >> 46)
+                    + ((ux & 0x0000200000000000) >> 45));
+
+      f1 = index * 0.015625; /* 0.015625 = 1/64 */
+      index -= 64;
+
+      /* Now take great care to compute f2 such that f1 + f2 = f */
+      if (xexp <= -2 || xexp >= MANTLENGTH_DP64 + 8)
+        {
+          f2 = f - f1;
+        }
+      else
+        {
+          /* Create the number m2 = 2.0^(-xexp) */
+          ux = (unsigned long long)(0x3ff - xexp) << EXPSHIFTBITS_DP64;
+          PUT_BITS_DP64(ux,m2);
+          if (xexp <= MANTLENGTH_DP64 - 1)
+            {
+              f2 = (m2 - f1) + m2*x;
+            }
+          else
+            {
+              f2 = (m2*x - f1) + m2;
+            }
+        }
+
+      /* At this point, x = 2**xexp * ( f1  +  f2 ) where
+         f1 = j/64, j = 1, 2, ..., 64 and |f2| <= 1/128. */
+
+      z1 = ln_lead_table[index];
+      q = ln_tail_table[index];
+
+      /* Calculate u = 2 f2 / ( 2 f1 + f2 ) = f2 / ( f1 + 0.5*f2 ) */
+      u = f2 / (f1 + 0.5 * f2);
+
+      /* Here, |u| <= 2(exp(1/16)-1) / (exp(1/16)+1).
+         The core approximation calculates
+         poly = [log(1 + u/2) - log(1 - u/2)]/u  -  1  */
+      v = u * u;
+      poly = (v * (cb_1 + v * (cb_2 + v * cb_3)));
+      z2 = q + (u + u * poly);
+
+      /* Now z1,z2 is an extra-precise approximation of log(f). */
+
+      /* Add xexp * log(2) to z1,z2 to get the result log(1+x).
+         The computed r1 is not subject to rounding error because
+         xexp has at most 10 significant bits, log(2) has 24 significant
+         bits, and z1 has up to 24 bits; and the exponents of z1
+         and z2 differ by at most 6. */
+      r1 = (xexp * log2_lead + z1);
+      r2 = (xexp * log2_tail + z2);
+      /* Natural log(1+x) */
+      return r1 + r2;
+    }
+  else
+    {
+      /* Arguments close to 0.0 are handled separately to maintain
+         accuracy.
+
+         The approximation in this region exploits the identity
+             log( 1 + r ) = log( 1 + u/2 )  -  log( 1 - u/2 ), where
+             u  = 2r / (2+r).
+         Note that the right hand side has an odd Taylor series expansion
+         which converges much faster than the Taylor series expansion of
+         log( 1 + r ) in r. Thus, we approximate log( 1 + r ) by
+             u + A1 * u^3 + A2 * u^5 + ... + An * u^(2n+1).
+
+         One subtlety is that since u cannot be calculated from
+         r exactly, the rounding error in the first u should be
+         avoided if possible. To accomplish this, we observe that
+                       u  =  r  -  r*r/(2+r).
+         Since x (=r) is the input argument, and thus presumed exact,
+         the formula above approximates u accurately because
+                       u  =  r  -  correction,
+         and the magnitude of "correction" (of the order of r*r)
+         is small.
+         With these observations, we will approximate log( 1 + r ) by
+            r + (  (A1*u^3 + ... + An*u^(2n+1)) - correction ).
+
+         We approximate log(1+r) by an odd polynomial in u, where
+                  u = 2r/(2+r) = r - r*r/(2+r).
+      */
+      r = x;
+      u = r / (2.0 + r);
+      correction = r * u;
+      u          = u + u;
+      v          = u * u;
+      r1 = r;
+      r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+      return r1 + r2;
+    }
+}
+
+weak_alias (__log1p, log1p)

diff --git a/src/log1pf.c b/src/log1pf.c
new file mode 100644
index 0000000..375a846
--- /dev/null
+++ b/src/log1pf.c

@@ -0,0 +1,416 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_NANF_WITH_FLAGS
+#define USE_VALF_WITH_FLAGS
+#define USE_INFINITYF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NANF_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+#undef USE_INFINITYF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline float retval_errno_erange_overflow(float x)
+{
+  struct exception exc;
+  exc.arg1 = (double)x;
+  exc.arg2 = (double)x;
+  exc.type = SING;
+  exc.name = (char *)"log1pf";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = -HUGE;
+  else
+    exc.retval = -infinityf_with_flags(AMD_F_DIVBYZERO);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(ERANGE);
+  else if (!matherr(&exc))
+    __set_errno(ERANGE);
+  return exc.retval;
+}
+
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x)
+{
+  struct exception exc;
+  exc.arg1 = (double)x;
+  exc.arg2 = (double)x;
+  exc.type = DOMAIN;
+  exc.name = (char *)"log1pf";
+  if (_LIB_VERSION == _SVID_)
+    exc.retval = -HUGE;
+  else
+    exc.retval = nanf_with_flags(AMD_F_INVALID);
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(EDOM);
+  else if (!matherr(&exc))
+    {
+      if(_LIB_VERSION == _SVID_)
+        (void)fputs("log1pf: DOMAIN error\n", stderr);
+    __set_errno(EDOM);
+    }
+  return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "log1pf"
+
+float FN_PROTOTYPE(log1pf)(float x)
+{
+
+  int xexp;
+  double dx, r, f, f1, f2, q, u, v, z1, z2, poly, m2;
+  int index;
+  unsigned int ux, ax;
+  unsigned long long lux;
+
+  /*
+    Computes natural log(1+x) for float arguments. Algorithm is
+    basically a promotion of the arguments to double followed
+    by an inlined version of the double algorithm, simplified
+    for efficiency (see log1p_amd.c). Simplifications include:
+    * Special algorithm for arguments near 0.0 not required
+    * Scaling of denormalised arguments not required
+    * Shorter core series approximations used
+    Note that we use a lookup table of size 64 rather than 128,
+    and compensate by having extra terms in the minimax polynomial
+    for the kernel approximation.
+  */
+
+/* Arrays ln_lead_table and ln_tail_table contain
+   leading and trailing parts respectively of precomputed
+   values of natural log(1+i/64), for i = 0, 1, ..., 64.
+   ln_lead_table contains the first 24 bits of precision,
+   and ln_tail_table contains a further 53 bits precision. */
+
+  static const double ln_lead_table[65] = {
+    0.00000000000000000000e+00,   /* 0x0000000000000000 */
+    1.55041813850402832031e-02,   /* 0x3f8fc0a800000000 */
+    3.07716131210327148438e-02,   /* 0x3f9f829800000000 */
+    4.58095073699951171875e-02,   /* 0x3fa7745800000000 */
+    6.06245994567871093750e-02,   /* 0x3faf0a3000000000 */
+    7.52233862876892089844e-02,   /* 0x3fb341d700000000 */
+    8.96121263504028320312e-02,   /* 0x3fb6f0d200000000 */
+    1.03796780109405517578e-01,   /* 0x3fba926d00000000 */
+    1.17783010005950927734e-01,   /* 0x3fbe270700000000 */
+    1.31576299667358398438e-01,   /* 0x3fc0d77e00000000 */
+    1.45181953907012939453e-01,   /* 0x3fc2955280000000 */
+    1.58604979515075683594e-01,   /* 0x3fc44d2b00000000 */
+    1.71850204467773437500e-01,   /* 0x3fc5ff3000000000 */
+    1.84922337532043457031e-01,   /* 0x3fc7ab8900000000 */
+    1.97825729846954345703e-01,   /* 0x3fc9525a80000000 */
+    2.10564732551574707031e-01,   /* 0x3fcaf3c900000000 */
+    2.23143517971038818359e-01,   /* 0x3fcc8ff780000000 */
+    2.35566020011901855469e-01,   /* 0x3fce270700000000 */
+    2.47836112976074218750e-01,   /* 0x3fcfb91800000000 */
+    2.59957492351531982422e-01,   /* 0x3fd0a324c0000000 */
+    2.71933674812316894531e-01,   /* 0x3fd1675c80000000 */
+    2.83768117427825927734e-01,   /* 0x3fd22941c0000000 */
+    2.95464158058166503906e-01,   /* 0x3fd2e8e280000000 */
+    3.07025015354156494141e-01,   /* 0x3fd3a64c40000000 */
+    3.18453729152679443359e-01,   /* 0x3fd4618bc0000000 */
+    3.29753279685974121094e-01,   /* 0x3fd51aad80000000 */
+    3.40926527976989746094e-01,   /* 0x3fd5d1bd80000000 */
+    3.51976394653320312500e-01,   /* 0x3fd686c800000000 */
+    3.62905442714691162109e-01,   /* 0x3fd739d7c0000000 */
+    3.73716354370117187500e-01,   /* 0x3fd7eaf800000000 */
+    3.84411692619323730469e-01,   /* 0x3fd89a3380000000 */
+    3.94993782043457031250e-01,   /* 0x3fd9479400000000 */
+    4.05465066432952880859e-01,   /* 0x3fd9f323c0000000 */
+    4.15827870368957519531e-01,   /* 0x3fda9cec80000000 */
+    4.26084339618682861328e-01,   /* 0x3fdb44f740000000 */
+    4.36236739158630371094e-01,   /* 0x3fdbeb4d80000000 */
+    4.46287095546722412109e-01,   /* 0x3fdc8ff7c0000000 */
+    4.56237375736236572266e-01,   /* 0x3fdd32fe40000000 */
+    4.66089725494384765625e-01,   /* 0x3fddd46a00000000 */
+    4.75845873355865478516e-01,   /* 0x3fde744240000000 */
+    4.85507786273956298828e-01,   /* 0x3fdf128f40000000 */
+    4.95077252388000488281e-01,   /* 0x3fdfaf5880000000 */
+    5.04556000232696533203e-01,   /* 0x3fe02552a0000000 */
+    5.13945698738098144531e-01,   /* 0x3fe0723e40000000 */
+    5.23248136043548583984e-01,   /* 0x3fe0be72e0000000 */
+    5.32464742660522460938e-01,   /* 0x3fe109f380000000 */
+    5.41597247123718261719e-01,   /* 0x3fe154c3c0000000 */
+    5.50647079944610595703e-01,   /* 0x3fe19ee6a0000000 */
+    5.59615731239318847656e-01,   /* 0x3fe1e85f40000000 */
+    5.68504691123962402344e-01,   /* 0x3fe23130c0000000 */
+    5.77315330505371093750e-01,   /* 0x3fe2795e00000000 */
+    5.86049020290374755859e-01,   /* 0x3fe2c0e9e0000000 */
+    5.94707071781158447266e-01,   /* 0x3fe307d720000000 */
+    6.03290796279907226562e-01,   /* 0x3fe34e2880000000 */
+    6.11801505088806152344e-01,   /* 0x3fe393e0c0000000 */
+    6.20240390300750732422e-01,   /* 0x3fe3d90260000000 */
+    6.28608644008636474609e-01,   /* 0x3fe41d8fe0000000 */
+    6.36907458305358886719e-01,   /* 0x3fe4618bc0000000 */
+    6.45137906074523925781e-01,   /* 0x3fe4a4f840000000 */
+    6.53301239013671875000e-01,   /* 0x3fe4e7d800000000 */
+    6.61398470401763916016e-01,   /* 0x3fe52a2d20000000 */
+    6.69430613517761230469e-01,   /* 0x3fe56bf9c0000000 */
+    6.77398800849914550781e-01,   /* 0x3fe5ad4040000000 */
+    6.85303986072540283203e-01,   /* 0x3fe5ee02a0000000 */
+    6.93147122859954833984e-01};  /* 0x3fe62e42e0000000 */
+
+  static const double ln_tail_table[65] = {
+    0.00000000000000000000e+00,   /* 0x0000000000000000 */
+    5.15092497094772879206e-09,   /* 0x3e361f807c79f3db */
+    4.55457209735272790188e-08,   /* 0x3e6873c1980267c8 */
+    2.86612990859791781788e-08,   /* 0x3e5ec65b9f88c69e */
+    2.23596477332056055352e-08,   /* 0x3e58022c54cc2f99 */
+    3.49498983167142274770e-08,   /* 0x3e62c37a3a125330 */
+    3.23392843005887000414e-08,   /* 0x3e615cad69737c93 */
+    1.35722380472479366661e-08,   /* 0x3e4d256ab1b285e9 */
+    2.56504325268044191098e-08,   /* 0x3e5b8abcb97a7aa2 */
+    5.81213608741512136843e-08,   /* 0x3e6f34239659a5dc */
+    5.59374849578288093334e-08,   /* 0x3e6e07fd48d30177 */
+    5.06615629004996189970e-08,   /* 0x3e6b32df4799f4f6 */
+    5.24588857848400955725e-08,   /* 0x3e6c29e4f4f21cf8 */
+    9.61968535632653505972e-10,   /* 0x3e1086c848df1b59 */
+    1.34829655346594463137e-08,   /* 0x3e4cf456b4764130 */
+    3.65557749306383026498e-08,   /* 0x3e63a02ffcb63398 */
+    3.33431709374069198903e-08,   /* 0x3e61e6a6886b0976 */
+    5.13008650536088382197e-08,   /* 0x3e6b8abcb97a7aa2 */
+    5.09285070380306053751e-08,   /* 0x3e6b578f8aa35552 */
+    3.20853940845502057341e-08,   /* 0x3e6139c871afb9fc */
+    4.06713248643004200446e-08,   /* 0x3e65d5d30701ce64 */
+    5.57028186706125221168e-08,   /* 0x3e6de7bcb2d12142 */
+    5.48356693724804282546e-08,   /* 0x3e6d708e984e1664 */
+    1.99407553679345001938e-08,   /* 0x3e556945e9c72f36 */
+    1.96585517245087232086e-09,   /* 0x3e20e2f613e85bda */
+    6.68649386072067321503e-09,   /* 0x3e3cb7e0b42724f6 */
+    5.89936034642113390002e-08,   /* 0x3e6fac04e52846c7 */
+    2.85038578721554472484e-08,   /* 0x3e5e9b14aec442be */
+    5.09746772910284482606e-08,   /* 0x3e6b5de8034e7126 */
+    5.54234668933210171467e-08,   /* 0x3e6dc157e1b259d3 */
+    6.29100830926604004874e-09,   /* 0x3e3b05096ad69c62 */
+    2.61974119468563937716e-08,   /* 0x3e5c2116faba4cdd */
+    4.16752115011186398935e-08,   /* 0x3e665fcc25f95b47 */
+    2.47747534460820790327e-08,   /* 0x3e5a9a08498d4850 */
+    5.56922172017964209793e-08,   /* 0x3e6de647b1465f77 */
+    2.76162876992552906035e-08,   /* 0x3e5da71b7bf7861d */
+    7.08169709942321478061e-09,   /* 0x3e3e6a6886b09760 */
+    5.77453510221151779025e-08,   /* 0x3e6f0075eab0ef64 */
+    4.43021445893361960146e-09,   /* 0x3e33071282fb989b */
+    3.15140984357495864573e-08,   /* 0x3e60eb43c3f1bed2 */
+    2.95077445089736670973e-08,   /* 0x3e5faf06ecb35c84 */
+    1.44098510263167149349e-08,   /* 0x3e4ef1e63db35f68 */
+    1.05196987538551827693e-08,   /* 0x3e469743fb1a71a5 */
+    5.23641361722697546261e-08,   /* 0x3e6c1cdf404e5796 */
+    7.72099925253243069458e-09,   /* 0x3e4094aa0ada625e */
+    5.62089493829364197156e-08,   /* 0x3e6e2d4c96fde3ec */
+    3.53090261098577946927e-08,   /* 0x3e62f4d5e9a98f34 */
+    3.80080516835568242269e-08,   /* 0x3e6467c96ecc5cbe */
+    5.66961038386146408282e-08,   /* 0x3e6e7040d03dec5a */
+    4.42287063097349852717e-08,   /* 0x3e67bebf4282de36 */
+    3.45294525105681104660e-08,   /* 0x3e6289b11aeb783f */
+    2.47132034530447431509e-08,   /* 0x3e5a891d1772f538 */
+    3.59655343422487209774e-08,   /* 0x3e634f10be1fb591 */
+    5.51581770357780862071e-08,   /* 0x3e6d9ce1d316eb93 */
+    3.60171867511861372793e-08,   /* 0x3e63562a19a9c442 */
+    1.94511067964296180547e-08,   /* 0x3e54e2adf548084c */
+    1.54137376631349347838e-08,   /* 0x3e508ce55cc8c97a */
+    3.93171034490174464173e-09,   /* 0x3e30e2f613e85bda */
+    5.52990607758839766440e-08,   /* 0x3e6db03ebb0227bf */
+    3.29990737637586136511e-08,   /* 0x3e61b75bb09cb098 */
+    1.18436010922446096216e-08,   /* 0x3e496f16abb9df22 */
+    4.04248680368301346709e-08,   /* 0x3e65b3f399411c62 */
+    2.27418915900284316293e-08,   /* 0x3e586b3e59f65355 */
+    1.70263791333409206020e-08,   /* 0x3e52482ceae1ac12 */
+    5.76999904754328540596e-08};  /* 0x3e6efa39ef35793c */
+
+  static const double
+    log2 = 6.931471805599453e-01,       /* 0x3fe62e42fefa39ef */
+
+  /* Approximating polynomial coefficients */
+    cb_1 = 8.33333333333333593622e-02,  /* 0x3fb5555555555557 */
+    cb_2 = 1.24999999978138668903e-02;  /* 0x3f89999999865ede */
+
+  GET_BITS_SP32(x, ux);
+  ax = ux & ~SIGNBIT_SP32;
+
+  if ((ux & EXPBITS_SP32) == EXPBITS_SP32)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_SP32)
+        {
+          /* x is NaN */
+#ifdef WINDOWS
+          return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN,
+                               0, EDOM, x, 0.0F);
+#else
+          return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+        }
+      else
+        {
+          /* x is infinity */
+          if (ux & SIGNBIT_SP32)
+            {
+              /* x is negative infinity. Return a NaN. */
+#ifdef WINDOWS
+              return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+                                   AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+              return retval_errno_edom(x);
+#endif
+            }
+          else
+            return x;
+        }
+    }
+  else if (ux >= 0xbf800000)
+    {
+      /* x <= -1.0 */
+      if (ux > 0xbf800000)
+        {
+          /* x is less than -1.0. Return a NaN. */
+#ifdef WINDOWS
+          return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+                               AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+          return retval_errno_edom(x);
+#endif
+        }
+      else
+        {
+          /* x is exactly -1.0. Return -infinity with div-by-zero flag. */
+#ifdef WINDOWS
+          return handle_errorf(_FUNCNAME, NINFBITPATT_SP32, _SING,
+                               AMD_F_DIVBYZERO, ERANGE, x, 0.0F);
+#else
+          return retval_errno_erange_overflow(x);
+#endif
+        }
+    }
+  else if (ax < 0x33800000)
+    {
+      if (ax == 0x00000000)
+        {
+          /* x is +/-zero. Return the same zero. */
+          return x;
+        }
+      else
+        /* abs(x) is less than float epsilon. Return x with inexact. */
+        return valf_with_flags(x, AMD_F_INEXACT);
+    }
+
+  dx = x;
+  /*
+    First, we decompose the argument dx to the form
+    1 + dx  =  2**M  *  (F1  +  F2),
+    where  1 <= F1+F2 < 2, M has the value of an integer,
+    F1 = 1 + j/64, j ranges from 0 to 64, and |F2| <= 1/128.
+
+    Second, we approximate log( 1 + F2/F1 ) by an odd polynomial
+    in U, where U  =  2 F2 / (2 F2 + F1).
+    Note that log( 1 + F2/F1 ) = log( 1 + U/2 ) - log( 1 - U/2 ).
+    The core approximation calculates
+    Poly = [log( 1 + U/2 ) - log( 1 - U/2 )]/U   -   1.
+    Note that  log(1 + U/2) - log(1 - U/2) = 2 arctanh ( U/2 ),
+    thus, Poly =  2 arctanh( U/2 ) / U  -  1.
+
+    It is not hard to see that
+    log(dx) = M*log(2) + log(F1) + log( 1 + F2/F1 ).
+    Hence, we return Z1 = log(F1), and  Z2 = log( 1 + F2/F1).
+    The values of log(F1) are calculated beforehand and stored
+    in the program.
+  */
+
+  f = 1.0 + dx;
+  GET_BITS_DP64(f, lux);
+
+  /* Store the exponent of f = 1 + dx in xexp and put
+     f into the range [1.0,2.0) */
+  xexp = (int)((lux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+  PUT_BITS_DP64((lux & MANTBITS_DP64) | ONEEXPBITS_DP64, f);
+
+  /* Now  (1+dx) = 2**(xexp)  * f,  1 <= f < 2. */
+
+  /* Set index to be the nearest integer to 64*f */
+  /* 64 <= index <= 128 */
+  /*
+    r = 64.0 * f;
+    index = (int)(r + 0.5);
+  */
+  /* This code instead of the above can save several cycles.
+     It only works because 64 <= r < 128, so
+     the nearest integer is always contained in exactly
+     7 bits, and the right shift is always the same. */
+  index = (int)((((lux & 0x000fc00000000000) | 0x0010000000000000) >> 46)
+                + ((lux & 0x0000200000000000) >> 45));
+
+  f1 = index * 0.015625; /* 0.015625 = 1/64 */
+  index -= 64;
+
+  /* Now take great care to compute f2 such that f1 + f2 = f */
+  if (xexp <= -2 || xexp >= MANTLENGTH_DP64 + 8)
+      {
+        f2 = f - f1;
+      }
+    else
+      {
+        /* Create the number m2 = 2.0^(-xexp) */
+        lux = (unsigned long long)(0x3ff - xexp) << EXPSHIFTBITS_DP64;
+        PUT_BITS_DP64(lux,m2);
+        if (xexp <= MANTLENGTH_DP64 - 1)
+          {
+            f2 = (m2 - f1) + m2*dx;
+          }
+        else
+          {
+            f2 = (m2*dx - f1) + m2;
+          }
+      }
+
+  /* At this point, dx = 2**xexp * ( f1  +  f2 ) where
+     f1 = j/64, j = 1, 2, ..., 64 and |f2| <= 1/128. */
+
+  z1 = ln_lead_table[index];
+  q = ln_tail_table[index];
+
+  /* Calculate u = 2 f2 / ( 2 f1 + f2 ) = f2 / ( f1 + 0.5*f2 ) */
+  u = f2 / (f1 + 0.5 * f2);
+
+  /* Here, |u| <= 2(exp(1/16)-1) / (exp(1/16)+1).
+     The core approximation calculates
+     poly = [log(1 + u/2) - log(1 - u/2)]/u  -  1  */
+  v = u * u;
+  poly = (v * (cb_1 + v * cb_2));
+  z2 = q + (u + u * poly);
+
+  /* Now z1,z2 is an extra-precise approximation of log(f). */
+
+  /* Add xexp * log(2) to z1,z2 to get the result log(1+x). */
+  r = xexp * log2 + z1 + z2;
+  /* Natural log(1+x) */
+  return (float)r;
+}
+
+weak_alias (__log1pf, log1pf)

diff --git a/src/log_special.c b/src/log_special.c
new file mode 100644
index 0000000..53a92b8
--- /dev/null
+++ b/src/log_special.c

@@ -0,0 +1,141 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+#ifdef __x86_64__
+
+#include <emmintrin.h>
+#include <math.h>
+#include <errno.h>
+
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+// y = log10f(x)
+// y = log10(x)
+// y = logf(x)
+// y = log(x)
+
+// these codes and the ones in the related .S or .asm files have to match
+#define LOG_X_ZERO      1
+#define LOG_X_NEG       2
+#define LOG_X_NAN       3
+
+static float _logf_special_common(float x, float y, U32 code, const char *name)
+{
+    switch(code)
+    {
+    case LOG_X_ZERO:
+        {
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO);
+            __amd_handle_errorf(SING, ERANGE, name, x, 0, 0.0f, 0, y, 0);
+        }
+        break;
+
+    case LOG_X_NEG:
+        {
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+            __amd_handle_errorf(DOMAIN, EDOM, name, x, 0, 0.0f, 0, y, 0);
+        }
+        break;
+
+    case LOG_X_NAN:
+        {
+#ifdef WIN64
+            // y is assumed to be qnan, only check x for snan
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, name, x, is_x_snan, 0.0f, 0, y, 0);
+#else
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+        }
+        break;
+    }
+
+    return y;
+}
+
+float _logf_special(float x, float y, U32 code)
+{
+    return _logf_special_common(x, y, code, "logf");
+}
+
+float _log10f_special(float x, float y, U32 code)
+{
+    return _logf_special_common(x, y, code, "log10f");
+}
+
+float _log2f_special(float x, float y, U32 code)
+{
+    return _logf_special_common(x, y, code, "log2f");
+}
+
+static double _log_special_common(double x, double y, U32 code,
+                                  const char *name)
+{
+    switch(code)
+    {
+    case LOG_X_ZERO:
+        {
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO);
+            __amd_handle_error(SING, ERANGE, name, x, 0.0, y);
+        }
+        break;
+
+    case LOG_X_NEG:
+        {
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+            __amd_handle_error(DOMAIN, EDOM, name, x, 0.0, y);
+        }
+        break;
+
+    case LOG_X_NAN:
+        {
+#ifdef WIN64
+            __amd_handle_error(DOMAIN, EDOM, name, x, 0.0, y);
+#else
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+        }
+        break;
+    }
+
+    return y;
+}
+
+double _log_special(double x, double y, U32 code)
+{
+    return _log_special_common(x, y, code, "log");
+}
+
+double _log10_special(double x, double y, U32 code)
+{
+    return _log_special_common(x, y, code, "log10");
+}
+
+double _log2_special(double x, double y, U32 code)
+{
+    return _log_special_common(x, y, code, "log2");
+}
+
+#endif /* __x86_64__ */

diff --git a/src/logb.c b/src/logb.c
new file mode 100644
index 0000000..7c75ef1
--- /dev/null
+++ b/src/logb.c

@@ -0,0 +1,102 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_INFINITY_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#endif
+
+#ifdef WINDOWS
+double FN_PROTOTYPE(logb)(double x)
+#else
+double FN_PROTOTYPE(logb)(double x)
+#endif
+{
+
+  unsigned long long ux;
+  long long u;
+  GET_BITS_DP64(x, ux);
+  u = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+  if ((ux & ~SIGNBIT_DP64) == 0)
+    /* x is +/-zero. Return -infinity with div-by-zero flag. */
+#ifdef WINDOWS
+    return handle_error("logb", NINFBITPATT_DP64, _SING,
+                        AMD_F_DIVBYZERO, ERANGE, x, 0.0);
+#else
+    return -infinity_with_flags(AMD_F_DIVBYZERO);
+#endif
+  else if (EMIN_DP64 <= u && u <= EMAX_DP64)
+    /* x is a normal number */
+    return (double)u;
+  else if (u > EMAX_DP64)
+    {
+      /* x is infinity or NaN */
+      if ((ux & MANTBITS_DP64) == 0)
+#ifdef WINDOWS
+        /* x is +/-infinity. For VC++, return infinity of same sign. */
+        return x;
+#else
+        /* x is +/-infinity. Return +infinity with no flags. */
+        return infinity_with_flags(0);
+#endif
+      else
+        /* x is NaN, result is NaN */
+#ifdef WINDOWS
+        return handle_error("logb", ux|0x0008000000000000, _DOMAIN,
+                            AMD_F_INVALID, EDOM, x, 0.0);
+#else
+        return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+    }
+  else
+    {
+      /* x is denormalized. */
+#ifdef FOLLOW_IEEE754_LOGB
+      /* Return the value of the minimum exponent to ensure that
+         the relationship between logb and scalb, defined in
+         IEEE 754, holds. */
+      return EMIN_DP64;
+#else
+      /* Follow the rule set by IEEE 854 for logb */
+      ux &= MANTBITS_DP64;
+      u = EMIN_DP64;
+      while (ux < IMPBIT_DP64)
+        {
+          ux <<= 1;
+          u--;
+        }
+      return (double)u;
+#endif
+    }
+
+}
+
+weak_alias (__logb, logb)

diff --git a/src/logbf.c b/src/logbf.c
new file mode 100644
index 0000000..d64e531
--- /dev/null
+++ b/src/logbf.c

@@ -0,0 +1,100 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_INFINITYF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_INFINITYF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#endif
+
+#ifdef WINDOWS
+float FN_PROTOTYPE(logbf)(float x)
+#else
+float FN_PROTOTYPE(logbf)(float x)
+#endif
+{
+  unsigned int ux;
+  int u;
+  GET_BITS_SP32(x, ux);
+  u = ((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+  if ((ux & ~SIGNBIT_SP32) == 0)
+    /* x is +/-zero. Return -infinity with div-by-zero flag. */
+#ifdef WINDOWS
+    return handle_errorf("logbf", NINFBITPATT_SP32, _SING,
+                         AMD_F_DIVBYZERO, ERANGE, x, 0.0F);
+#else
+    return -infinityf_with_flags(AMD_F_DIVBYZERO);
+#endif
+  else if (EMIN_SP32 <= u && u <= EMAX_SP32)
+    /* x is a normal number */
+    return (float)u;
+  else if (u > EMAX_SP32)
+    {
+      /* x is infinity or NaN */
+      if ((ux & MANTBITS_SP32) == 0)
+#ifdef WINDOWS
+        /* x is +/-infinity. For VC++, return infinity of same sign. */
+        return x;
+#else
+        /* x is +/-infinity. Return +infinity with no flags. */
+        return infinityf_with_flags(0);
+#endif
+      else
+        /* x is NaN, result is NaN */
+#ifdef WINDOWS
+        return handle_errorf("logbf", ux|0x00400000, _DOMAIN,
+                             AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+        return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+    }
+  else
+    {
+      /* x is denormalized. */
+#ifdef FOLLOW_IEEE754_LOGB
+      /* Return the value of the minimum exponent to ensure that
+         the relationship between logb and scalb, defined in
+         IEEE 754, holds. */
+      return EMIN_SP32;
+#else
+      /* Follow the rule set by IEEE 854 for logb */
+      ux &= MANTBITS_SP32;
+      u = EMIN_SP32;
+      while (ux < IMPBIT_SP32)
+        {
+          ux <<= 1;
+          u--;
+        }
+      return (float)u;
+#endif
+    }
+}
+
+weak_alias (__logbf, logbf)

diff --git a/src/lrint.c b/src/lrint.c
new file mode 100644
index 0000000..e3c0e41
--- /dev/null
+++ b/src/lrint.c

@@ -0,0 +1,62 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+long int FN_PROTOTYPE(lrint)(double x)
+{
+
+    UT64 checkbits,val_2p52;
+    checkbits.f64=x;
+
+    /* Clear the sign bit and check if the value can be rounded */
+
+    if( (checkbits.u64 & 0x7FFFFFFFFFFFFFFF) > 0x4330000000000000)
+    {
+        /* number cant be rounded raise an exception */
+        /* Number exceeds the representable range could be nan or inf also*/
+		__amd_handle_error(DOMAIN, EDOM, "lrint", x,0.0 ,(double)x);
+
+
+		return (long int) x;
+    }
+
+    val_2p52.u32[1] = (checkbits.u32[1] & 0x80000000) | 0x43300000;
+    val_2p52.u32[0] = 0;
+
+	/* Add and sub 2^52 to round the number according to the current rounding direction */
+
+    return (long int) ((x + val_2p52.f64) - val_2p52.f64);
+}

diff --git a/src/lrintf.c b/src/lrintf.c
new file mode 100644
index 0000000..abcd37b
--- /dev/null
+++ b/src/lrintf.c

@@ -0,0 +1,67 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+long int FN_PROTOTYPE(lrintf)(float x)
+{
+
+    UT32 checkbits,val_2p23;
+    checkbits.f32=x;
+
+    /* Clear the sign bit and check if the value can be rounded */
+
+    if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000)
+    {
+        /* number cant be rounded raise an exception */
+        /* Number exceeds the representable range could be nan or inf also*/
+
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "lrintf", x, is_x_snan, 0.0F , 0,(float)x, 0);
+        }
+
+		return (long int) x;
+    }
+
+
+    val_2p23.u32 = (checkbits.u32 & 0x80000000) | 0x4B000000;
+
+   /* Add and sub 2^23 to round the number according to the current rounding direction */
+
+    return (long int) ((x + val_2p23.f32) - val_2p23.f32);
+}

diff --git a/src/lround.c b/src/lround.c
new file mode 100644
index 0000000..dfe411d
--- /dev/null
+++ b/src/lround.c

@@ -0,0 +1,135 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+long int FN_PROTOTYPE(lround)(double d)
+{
+    UT64 u64d;
+    UT64 u64Temp,u64result;
+    int intexp, shift;
+    U64 sign;
+    long int result;
+
+    u64d.f64 = u64Temp.f64 = d;
+
+    if ((u64d.u32[1] & 0X7FF00000) == 0x7FF00000)
+    {
+        /*else the number is infinity*/
+        //Raise range or domain error
+        #ifdef WIN64
+            __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double)SIGNBIT_SP32);
+			return (long int )SIGNBIT_SP32;
+        #else
+            __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double)SIGNBIT_DP64);
+			return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/
+        #endif
+
+    }
+
+    u64Temp.u32[1] &= 0x7FFFFFFF;
+    intexp = (u64d.u32[1] & 0x7FF00000) >> 20;
+    sign = u64d.u64 & 0x8000000000000000;
+    intexp -= 0x3FF;
+
+    /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */
+    if (intexp < -1)
+        return (0);
+
+#ifdef WIN64
+    /* 1.0 x 2^31 (or 2^63) is already too large */
+    if (intexp >= 31)
+    {
+        /*Based on the sign of the input value return the MAX and MIN*/
+        result = 0x80000000; /*Return LONG MIN*/
+        
+        __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double) result);
+
+        return result;
+    }
+
+
+#else
+    /* 1.0 x 2^31 (or 2^63) is already too large */
+    if (intexp >= 63)
+    {
+        /*Based on the sign of the input value return the MAX and MIN*/
+        result = 0x8000000000000000; /*Return LONG MIN*/
+            
+        __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double) result);
+
+        return result;
+    }
+
+#endif
+
+    u64result.f64 = u64Temp.f64;
+    /* >= 2^52 is already an exact integer */
+#ifdef WIN64
+    if (intexp < 23)
+#else
+    if (intexp < 52)
+#endif
+    {
+        /* add 0.5, extraction below will truncate */
+        u64result.f64 = u64Temp.f64 + 0.5;
+    }
+
+    intexp = ((u64result.u32[1] >> 20) & 0x7ff) - 0x3FF;
+
+    u64result.u32[1] &= 0xfffff;
+    u64result.u32[1] |= 0x00100000; /*Mask the last exp bit to 1*/
+    shift = intexp - 52;
+
+#ifdef WIN64
+	/*The shift value will always be negative.*/
+    u64result.u64 = u64result.u64 >> (-shift);
+	/*Result will be stored in the lower word due to the shift being performed*/
+    result = u64result.u32[0];
+#else
+     if(shift < 0)
+        u64result.u64 = u64result.u64 >> (-shift);
+    if(shift > 0)
+        u64result.u64 = u64result.u64 << (shift);
+
+    result = u64result.u64;
+#endif
+
+
+
+    if (sign)
+        result = -result;
+
+    return result;
+}
+

diff --git a/src/lroundf.c b/src/lroundf.c
new file mode 100644
index 0000000..799e960
--- /dev/null
+++ b/src/lroundf.c

@@ -0,0 +1,147 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+long int FN_PROTOTYPE(lroundf)(float f)
+{
+    UT32 u32d;
+    UT32 u32Temp,u32result;
+    int intexp, shift;
+    U32 sign;
+    long int  result;
+
+    u32d.f32 = u32Temp.f32 = f;
+    if ((u32d.u32 & 0X7F800000) == 0x7F800000)
+    {
+        /*else the number is infinity*/
+		//Raise range or domain error
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = f;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+		#ifdef WIN64
+            __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)SIGNBIT_SP32, 0);
+			return (long int)SIGNBIT_SP32;
+		#else
+            __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)SIGNBIT_DP64, 0);
+			return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/
+		#endif
+        }
+
+    }
+
+    u32Temp.u32 &= 0x7FFFFFFF;
+    intexp = (u32d.u32 & 0x7F800000) >> 23;
+    sign = u32d.u32 & 0x80000000;
+    intexp -= 0x7F;
+
+
+    /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */
+    if (intexp < -1)
+        return (0);
+
+
+#ifdef WIN64
+    /* 1.0 x 2^31 is already too large */
+    if (intexp >= 31)
+    {
+        result = 0x80000000;
+
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = f;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)result, 0);
+        }
+
+        return result;
+	}
+
+#else
+    /* 1.0 x 2^31 (or 2^63) is already too large */
+    if (intexp >= 63)
+    {
+        result = 0x8000000000000000;
+            
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = f;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)result, 0);
+        }
+
+        return result;
+    }
+ #endif
+
+    u32result.f32 = u32Temp.f32;
+
+    /* >= 2^23 is already an exact integer */
+    if (intexp < 23)
+    {
+        /* add 0.5, extraction below will truncate */
+        u32result.f32 = u32Temp.f32 + 0.5F;
+    }
+    intexp = (u32result.u32 & 0x7f800000) >> 23;
+    intexp -= 0x7f;
+    u32result.u32 &= 0x7fffff;
+    u32result.u32 |= 0x00800000;
+
+    result = u32result.u32;
+
+    #ifdef WIN64
+    shift = intexp - 23;
+    #else
+
+    /*Since float is only 32 bit for higher accuracy we shift the result by 32 bits
+     * In the next step we shift an extra 32 bits in the reverse direction based
+     * on the value of intexp*/
+    result = result << 32;
+    shift = intexp - 55; /*55= 23 +32*/
+    #endif
+
+
+	if(shift < 0)
+		result = result >> (-shift);
+	if(shift > 0)
+        result = result << (shift);
+
+    if (sign)
+        result = -result;
+    return result;
+
+}
+
+
+

diff --git a/src/modf.c b/src/modf.c
new file mode 100644
index 0000000..836db46
--- /dev/null
+++ b/src/modf.c

@@ -0,0 +1,80 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+double FN_PROTOTYPE(modf)(double x, double *iptr)
+{
+  /* modf splits the argument x into integer and fraction parts,
+     each with the same sign as x. */
+
+
+  long long xexp;
+  unsigned long long ux, ax, mask;
+
+  GET_BITS_DP64(x, ux);
+  ax = ux & (~SIGNBIT_DP64);
+
+  if (ax >= 0x4340000000000000)
+    {
+      /* abs(x) is either NaN, infinity, or >= 2^53 */
+      if (ax > 0x7ff0000000000000)
+        {
+          /* x is NaN */
+          *iptr = x;
+          return x + x; /* Raise invalid if it is a signalling NaN */
+        }
+      else
+        {
+          /* x is infinity or large. Return zero with the sign of x */
+          *iptr = x;
+          PUT_BITS_DP64(ux & SIGNBIT_DP64, x);
+          return x;
+        }
+    }
+  else if (ax < 0x3ff0000000000000)
+    {
+      /* abs(x) < 1.0. Set iptr to zero with the sign of x
+         and return x. */
+      PUT_BITS_DP64(ux & SIGNBIT_DP64, *iptr);
+      return x;
+    }
+  else
+    {
+      double r;
+      unsigned long long ur;
+      xexp = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+      /* Mask out the bits of x that we don't want */
+      mask = 1;
+      mask = (mask << (EXPSHIFTBITS_DP64 - xexp)) - 1;
+      PUT_BITS_DP64(ux & ~mask, *iptr);
+      r = x - *iptr;
+      GET_BITS_DP64(r, ur);
+      PUT_BITS_DP64(((ux & SIGNBIT_DP64)|ur), r);
+      return r;
+    }
+
+}
+
+weak_alias (__modf, modf)

diff --git a/src/modff.c b/src/modff.c
new file mode 100644
index 0000000..7e5eae7
--- /dev/null
+++ b/src/modff.c

@@ -0,0 +1,74 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+float FN_PROTOTYPE(modff)(float x, float *iptr)
+{
+  /* modff splits the argument x into integer and fraction parts,
+     each with the same sign as x. */
+
+  unsigned int ux, mask;
+  int xexp;
+
+  GET_BITS_SP32(x, ux);
+  xexp = ((ux & (~SIGNBIT_SP32)) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+
+  if (xexp < 0)
+    {
+      /* abs(x) < 1.0. Set iptr to zero with the sign of x
+         and return x. */
+      PUT_BITS_SP32(ux & SIGNBIT_SP32, *iptr);
+      return x;
+    }
+  else if (xexp < EXPSHIFTBITS_SP32)
+    {
+      float r;
+      unsigned int ur;
+      /* x lies between 1.0 and 2**(24) */
+      /* Mask out the bits of x that we don't want */
+      mask = (1 << (EXPSHIFTBITS_SP32 - xexp)) - 1;
+      PUT_BITS_SP32(ux & ~mask, *iptr);
+      r = x - *iptr;
+      GET_BITS_SP32(r, ur);
+      PUT_BITS_SP32(((ux & SIGNBIT_SP32)|ur), r);
+      return r;
+    }
+  else if ((ux & (~SIGNBIT_SP32)) > 0x7f800000)
+    {
+      /* x is NaN */
+      *iptr = x;
+      return x + x; /* Raise invalid if it is a signalling NaN */
+    }
+  else
+    {
+      /* x is infinity or large. Set iptr to x and return zero
+         with the sign of x. */
+      *iptr = x;
+      PUT_BITS_SP32(ux & SIGNBIT_SP32, x);
+      return x;
+    }
+}
+
+weak_alias (__modff, modff)

diff --git a/src/nan.c b/src/nan.c
new file mode 100644
index 0000000..fbfc52c
--- /dev/null
+++ b/src/nan.c

@@ -0,0 +1,114 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+#include <stdio.h>
+
+double  FN_PROTOTYPE(nan)(const char *tagp)
+{
+
+
+    /* Check for input range */
+    UT64 checkbits;
+    U64 val=0;
+    S64 num;
+    checkbits.u64  =QNANBITPATT_DP64; 
+    if(tagp == NULL)
+    {
+      return checkbits.f64;
+    }
+
+    switch(*tagp)
+    {
+    case '0': /* base 8 */
+                tagp++; 
+                if( *tagp == 'x' || *tagp == 'X')
+                {
+                    /* base 16 */
+                    tagp++;
+                    while(*tagp != '\0')
+                    {
+                        
+                        if(*tagp >= 'A' && *tagp <= 'F' )
+                        {
+                            num = *tagp - 'A' + 10;
+                        }
+                        else
+                        if(*tagp >= 'a' && *tagp <= 'f' )
+                        {                          
+                            num = *tagp - 'a' + 10;  
+                        }
+                        else
+                        {
+                            num = *tagp - '0'; 
+                        }                        
+
+                        if( (num < 0 || num > 15))
+                        {
+                            val = QNANBITPATT_DP64;
+                            break;
+                        }
+                        val = (val << 4)  |  num; 
+                        tagp++;
+                    }
+                }
+                else
+                {
+                    /* base 8 */
+                    while(*tagp != '\0')
+                    {
+                        num = *tagp - '0';
+                        if( num < 0 || num > 7)
+                        {
+                            val = QNANBITPATT_DP64;
+                            break;
+                        }
+                        val = (val << 3)  |  num; 
+                        tagp++;
+                    }
+                }
+		break;
+    default:
+                while(*tagp != '\0')
+                {
+                    val = val*10;
+                    num = *tagp - '0';
+                    if( num < 0 || num > 9)
+                    {
+                        val = QNANBITPATT_DP64;
+                        break;
+                    }
+                    val = val + num; 
+                    tagp++;
+                }
+            
+    }
+
+   if((val & ~NINFBITPATT_DP64) == 0)
+	val = QNANBITPATT_DP64;
+	 
+    checkbits.u64 = (val | QNANBITPATT_DP64) & ~SIGNBIT_DP64;
+    return checkbits.f64  ;
+}
+

diff --git a/src/nanf.c b/src/nanf.c
new file mode 100644
index 0000000..8d712f2
--- /dev/null
+++ b/src/nanf.c

@@ -0,0 +1,120 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+#include <stdio.h>
+
+
+float  FN_PROTOTYPE(nanf)(const char *tagp)
+{
+
+
+    /* Check for input range */
+    UT32 checkbits;
+    U32 val=0;
+    S32 num;
+    checkbits.u32  =QNANBITPATT_SP32; 
+    if(tagp == NULL)
+      return  checkbits.f32 ;
+      
+
+    switch(*tagp)
+    {
+    case '0': /* base 8 */
+                tagp++; 
+                if( *tagp == 'x' || *tagp == 'X')
+                {
+                    /* base 16 */
+                    tagp++;
+                    while(*tagp != '\0')
+                    {
+                        
+                        if(*tagp >= 'A' && *tagp <= 'F' )
+                        {
+                            num = *tagp - 'A' + 10;
+                        }
+                        else
+                        if(*tagp >= 'a' && *tagp <= 'f' )
+                        {                          
+                            num = *tagp - 'a' + 10;  
+                        }
+                        else
+                        {
+                            num = *tagp - '0'; 
+                        }                        
+
+                        if( (num < 0 || num > 15))
+                        {
+                            val = QNANBITPATT_SP32;
+                            break;
+                        }
+                        val = (val << 4)  |  num; 
+                        tagp++;
+                    }
+                }
+                else
+                {
+                    /* base 8 */
+                    while(*tagp != '\0')
+                    {
+                        num = *tagp - '0';
+                        if( num < 0 || num > 7)
+                        {
+                            val = QNANBITPATT_SP32;
+                            break;
+                        }
+                        val = (val << 3)  |  num; 
+                        tagp++;
+                    }
+                }
+		break;
+    default:
+                while(*tagp != '\0')
+                {
+                    val = val*10;
+                    num = *tagp - '0';
+                    if( num < 0 || num > 9)
+                    {
+                        val = QNANBITPATT_SP32;
+                        break;
+                    }
+                    val = val + num; 
+                    tagp++;
+                }
+            
+    }
+     
+/*   if(val > ~INDEFBITPATT_SP32)
+	val = (val | QNANBITPATT_SP32) & ~SIGNBIT_SP32;
+	 
+    checkbits.u32 = val | EXPBITS_SP32 ;	 */
+
+   if((val & ~INDEFBITPATT_SP32) == 0)
+	val = QNANBITPATT_SP32;
+	 
+    checkbits.u32 = (val | QNANBITPATT_SP32) & ~SIGNBIT_SP32;
+
+
+	return checkbits.f32  ;
+}

diff --git a/src/nearbyintf.c b/src/nearbyintf.c
new file mode 100644
index 0000000..2b656ef
--- /dev/null
+++ b/src/nearbyintf.c

@@ -0,0 +1,51 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+float FN_PROTOTYPE(nearbyintf)(float x)
+{
+    /* Check for input range */
+    UT32 checkbits,sign,val_2p23;
+    checkbits.f32=x;
+
+    /* Clear the sign bit and check if the value can be rounded(i.e check if exponent less than 23) */
+    if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000)
+    {
+      /* take care of nan or inf */
+      if((checkbits.u32 & 0x7f800000)== 0x7f800000)
+          return x+x;
+      else
+          return x;
+    }
+
+    sign.u32 =  checkbits.u32 & 0x80000000;   
+    val_2p23.u32 = sign.u32 | 0x4B000000;
+    val_2p23.f32 = (x + val_2p23.f32) - val_2p23.f32;
+    /*This extra line is to take care of denormals and various rounding modes*/
+    val_2p23.u32 = ((val_2p23.u32 << 1) >> 1) | sign.u32;
+    return (val_2p23.f32);   
+}
+

diff --git a/src/nextafter.c b/src/nextafter.c
new file mode 100644
index 0000000..62d9b5a
--- /dev/null
+++ b/src/nextafter.c

@@ -0,0 +1,91 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#endif
+
+#include <float.h>
+#include <math.h>
+#include <errno.h>
+#include <limits.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+
+double FN_PROTOTYPE(nextafter)(double x, double y)
+{
+
+
+    UT64 checkbits;
+    double dy = y;
+    checkbits.f64=x;
+
+    /* if x == y return y in the type of x */
+    if( x == dy )
+    {
+        return dy;
+    }
+
+    /* check if the number is nan */
+    if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 ))
+    {
+		__amd_handle_error(DOMAIN, ERANGE, "nextafter", x, y , x+x);
+
+        return x+x;
+    }
+
+    if( x == 0.0)
+    {
+        checkbits.u64 = 1;
+        if( dy > 0.0 )
+             return checkbits.f64;
+        else
+            return -checkbits.f64;
+    }
+
+
+    /* compute the next heigher or lower value */
+
+    if(((x>0.0) ^ (dy>x)) == 0)
+    {
+        checkbits.u64++;
+    }
+    else
+    {
+        checkbits.u64--;
+    }
+
+    /* check if the result is nan or inf */
+    if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 ))
+    {
+		__amd_handle_error(DOMAIN, ERANGE, "nextafter", x, y , checkbits.f64);
+
+    }
+
+    return checkbits.f64;
+}

diff --git a/src/nextafterf.c b/src/nextafterf.c
new file mode 100644
index 0000000..019187f
--- /dev/null
+++ b/src/nextafterf.c

@@ -0,0 +1,102 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#endif
+
+
+#include <float.h>
+#include <math.h>
+#include <errno.h>
+#include <limits.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+
+float FN_PROTOTYPE(nextafterf)(float x, float y)
+{
+
+
+    UT32 checkbits;
+    float dy = y;
+    checkbits.f32=x;
+
+    /* if x == y return y in the type of x */
+    if( x == dy )
+    {
+        return  dy;
+    }
+
+    /* check if the number is nan */
+    if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 ))
+    {
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, ERANGE, "nextafterf", x, is_x_snan, y , 0,x+x, 0);
+
+        }
+
+        return x+x;
+    }
+
+    if( x == 0.0)
+    {
+        checkbits.u32 = 1;
+        if( dy > 0.0 )
+             return checkbits.f32;
+        else
+            return -checkbits.f32;
+    }
+
+
+    /* compute the next heigher or lower value */
+    if(((x>0.0F) ^ (dy>x)) == 0)
+    {
+        checkbits.u32++;
+    }
+    else
+    {
+        checkbits.u32--;
+    }
+
+    /* check if the result is nan or inf */
+    if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 ))
+    {
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, ERANGE, "nextafterf", x, is_x_snan, y , 0,checkbits.f32, 0);
+
+        }
+    }
+
+    return checkbits.f32;
+}

diff --git a/src/nexttoward.c b/src/nexttoward.c
new file mode 100644
index 0000000..14b2f62
--- /dev/null
+++ b/src/nexttoward.c

@@ -0,0 +1,93 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+
+double FN_PROTOTYPE(nexttoward)(double x, long double y)
+{
+
+
+    UT64 checkbits;
+    long double dy = (long double) y;
+    checkbits.f64=x;
+
+    /* if x == y return y in the type of x */
+    if( x == dy )
+    {
+        return (double) dy;
+    }
+
+    /* check if the number is nan */
+    if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 ))
+    {
+
+		__amd_handle_error(DOMAIN, ERANGE, "nexttoward", x, (double)y ,x+x);
+
+
+        return x+x;
+    }
+
+    if( x == 0.0)
+    {
+        checkbits.u64 = 1;
+        if( dy > 0.0 )
+             return checkbits.f64;
+        else
+            return -checkbits.f64;
+    }
+
+
+    /* compute the next heigher or lower value */
+
+    if(((x>0.0) ^ (dy>x)) == 0)
+    {
+        checkbits.u64++;
+    }
+    else
+    {
+        checkbits.u64--;
+    }
+
+    /* check if the result is nan or inf */
+    if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 ))
+    {
+		__amd_handle_error(DOMAIN, ERANGE, "nexttoward", x, (double)y ,checkbits.f64);
+
+
+    }
+
+    return checkbits.f64;
+}

diff --git a/src/nexttowardf.c b/src/nexttowardf.c
new file mode 100644
index 0000000..47b42c7
--- /dev/null
+++ b/src/nexttowardf.c

@@ -0,0 +1,97 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+float FN_PROTOTYPE(nexttowardf)(float x, long double y)
+{
+
+
+    UT32 checkbits;
+    long double dy = (long double) y;
+    checkbits.f32=x;
+
+    /* if x == y return y in the type of x */
+    if( x == dy )
+    {
+        return (float) dy;
+    }
+
+    /* check if the number is nan */
+    if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 ))
+    {
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, ERANGE, "nexttowardf", x, is_x_snan, (float) y , 0,x+x, 0);
+
+        }
+
+        return x+x;
+    }
+
+    if( x == 0.0)
+    {
+        checkbits.u32 = 1;
+        if( dy > 0.0 )
+             return checkbits.f32;
+        else
+            return -checkbits.f32;
+    }
+
+
+    /* compute the next heigher or lower value */
+    if(((x>0.0F) ^ (dy>x)) == 0)
+    {
+        checkbits.u32++;
+    }
+    else
+    {
+        checkbits.u32--;
+    }
+
+    /* check if the result is nan or inf */
+    if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 ))
+    {
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, ERANGE, "nexttowardf", x, is_x_snan, (float) y , 0,checkbits.f32, 0);
+        }
+    }
+
+    return checkbits.f32;
+}

diff --git a/src/pow_special.c b/src/pow_special.c
new file mode 100644
index 0000000..cb571d2
--- /dev/null
+++ b/src/pow_special.c

@@ -0,0 +1,168 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+#ifdef __x86_64__
+
+#include <emmintrin.h>
+#include <math.h>
+#include <errno.h>
+
+
+
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+// these codes and the ones in the related .S or .asm files have to match
+#define POW_X_ONE_Y_SNAN            1
+#define POW_X_ZERO_Z_INF            2
+#define POW_X_NAN                   3
+#define POW_Y_NAN                   4
+#define POW_X_NAN_Y_NAN             5
+#define POW_X_NEG_Y_NOTINT          6
+#define POW_Z_ZERO                  7
+#define POW_Z_DENORMAL              8
+#define POW_Z_INF                   9
+
+float _powf_special(float x, float y, float z, U32 code)
+{
+    switch(code)
+    {
+    case POW_X_ONE_Y_SNAN:
+        {
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+        }
+        break;
+
+    case POW_X_ZERO_Z_INF:
+        {
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO);
+            __amd_handle_errorf(SING, ERANGE, "powf", x, 0, y, 0, z, 0);
+        }
+        break;
+
+    case POW_X_NAN:
+    case POW_Y_NAN:
+    case POW_X_NAN_Y_NAN:
+        {
+#ifdef WIN64
+            unsigned int is_x_snan = 0, is_y_snan = 0, is_z_snan = 0;
+            UT32 xm, ym, zm;
+            xm.f32 = x;
+            ym.f32 = y;
+            zm.f32 = z;
+            if(code == POW_X_NAN) { is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); }
+            if(code == POW_Y_NAN) { is_y_snan = ( ((ym.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); }
+            if(code == POW_X_NAN_Y_NAN) {   is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+                                            is_y_snan = ( ((ym.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); }
+            is_z_snan = ( ((zm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+            __amd_handle_errorf(DOMAIN, EDOM, "powf", x, is_x_snan, y, is_y_snan, z, is_z_snan);            
+#else
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+        }
+        break;
+
+    case POW_X_NEG_Y_NOTINT:
+        {
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+            __amd_handle_errorf(DOMAIN, EDOM, "powf", x, 0, y, 0, z, 0);
+        }
+        break;
+
+    case POW_Z_ZERO:
+        {
+            _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW));
+            __amd_handle_errorf(UNDERFLOW, ERANGE, "powf", x, 0, y, 0, z, 0);
+        }
+        break;
+
+    case POW_Z_INF:
+        {
+            _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW));
+            __amd_handle_errorf(OVERFLOW, ERANGE, "powf", x, 0, y, 0, z, 0);
+        }
+        break;
+    }
+
+    return z;
+}
+
+double _pow_special(double x, double y, double z, U32 code)
+{
+    switch(code)
+    {
+    case POW_X_ONE_Y_SNAN:
+        {
+#ifdef WIN64
+#else
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+        }
+        break;
+
+    case POW_X_ZERO_Z_INF:
+        {
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO);
+            __amd_handle_error(SING, ERANGE, "pow", x, y, z);
+        }
+        break;
+
+    case POW_X_NAN:
+    case POW_Y_NAN:
+    case POW_X_NAN_Y_NAN:
+        {
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#ifdef WIN64
+            __amd_handle_error(DOMAIN, EDOM, "pow", x, y, z);
+#endif
+        }
+        break;
+
+    case POW_X_NEG_Y_NOTINT:
+        {
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+            __amd_handle_error(DOMAIN, EDOM, "pow", x, y, z);
+        }
+        break;
+
+    case POW_Z_ZERO:
+    case POW_Z_DENORMAL:
+        {
+            _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW));
+            __amd_handle_error(UNDERFLOW, ERANGE, "pow", x, y, z);
+        }
+        break;
+
+    case POW_Z_INF:
+        {
+            _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW));
+            __amd_handle_error(OVERFLOW, ERANGE, "pow", x, y, z);
+        }
+        break;
+    }
+
+    return z;
+}
+
+#endif /* __x86_64__ */

diff --git a/src/remainder_piby2.c b/src/remainder_piby2.c
new file mode 100644
index 0000000..3f6676f
--- /dev/null
+++ b/src/remainder_piby2.c

@@ -0,0 +1,331 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#define EXPBITS_DP64      0x7ff0000000000000
+#define EXPSHIFTBITS_DP64 52
+#define EXPBIAS_DP64      1023
+#define MANTBITS_DP64     0x000fffffffffffff
+#define IMPBIT_DP64       0x0010000000000000
+#define SIGNBIT_DP64      0x8000000000000000
+
+
+#define GET_BITS_DP64(x, ux) \
+  { \
+    volatile union {double d; unsigned long long i;} _bitsy; \
+    _bitsy.d = (x); \
+    ux = _bitsy.i; \
+  }
+
+#define PUT_BITS_DP64(ux, x) \
+  { \
+    volatile union {double d; unsigned long long i;} _bitsy; \
+    _bitsy.i = (ux); \
+    x = _bitsy.d; \
+  }
+
+/* Define this to get debugging print statements activated */
+#define DEBUGGING_PRINT
+#undef DEBUGGING_PRINT
+
+
+#ifdef DEBUGGING_PRINT
+#include <stdio.h>
+char *d2b(int d, int bitsper, int point)
+{
+  static char buff[50];
+  int i, j;
+  j = bitsper;
+  if (point >= 0 && point <= bitsper)
+    j++;
+  buff[j] = '\0';
+  for (i = bitsper - 1; i >= 0; i--)
+    {
+      j--;
+      if (d % 2 == 1)
+        buff[j] = '1';
+      else
+        buff[j] = '0';
+      if (i == point)
+        {
+          j--;
+          buff[j] = '.';
+        }
+      d /= 2;
+    }
+  return buff;
+}
+#endif
+
+/* Given positive argument x, reduce it to the range [-pi/4,pi/4] using
+   extra precision, and return the result in r, rr.
+   Return value "region" tells how many lots of pi/2 were subtracted
+   from x to put it in the range [-pi/4,pi/4], mod 4. */
+void __amd_remainder_piby2(double x, double *r, double *rr, int *region)
+{
+
+      /* This method simulates multi-precision floating-point
+         arithmetic and is accurate for all 1 <= x < infinity */
+      static const double
+        piby2_lead = 1.57079632679489655800e+00, /* 0x3ff921fb54442d18 */
+        piby2_part1 = 1.57079631090164184570e+00, /* 0x3ff921fb50000000 */
+        piby2_part2 = 1.58932547122958567343e-08, /* 0x3e5110b460000000 */
+        piby2_part3 = 6.12323399573676480327e-17; /* 0x3c91a62633145c06 */
+      const int bitsper = 10;
+      unsigned long long res[500];
+      unsigned long long ux, u, carry, mask, mant, highbitsrr;
+      int first, last, i, rexp, xexp, resexp, ltb, determ;
+      double xx, t;
+      static unsigned long long pibits[] =
+      {
+        0,    0,    0,    0,    0,    0,
+        162,  998,   54,  915,  580,   84,  671,  777,  855,  839,
+        851,  311,  448,  877,  553,  358,  316,  270,  260,  127,
+        593,  398,  701,  942,  965,  390,  882,  283,  570,  265,
+        221,  184,    6,  292,  750,  642,  465,  584,  463,  903,
+        491,  114,  786,  617,  830,  930,   35,  381,  302,  749,
+        72,  314,  412,  448,  619,  279,  894,  260,  921,  117,
+        569,  525,  307,  637,  156,  529,  504,  751,  505,  160,
+        945, 1022,  151, 1023,  480,  358,   15,  956,  753,   98,
+        858,   41,  721,  987,  310,  507,  242,  498,  777,  733,
+        244,  399,  870,  633,  510,  651,  373,  158,  940,  506,
+        997,  965,  947,  833,  825,  990,  165,  164,  746,  431,
+        949, 1004,  287,  565,  464,  533,  515,  193,  111,  798
+      };
+
+      GET_BITS_DP64(x, ux);
+
+#ifdef DEBUGGING_PRINT
+      printf("On entry, x = %25.20e = %s\n", x, double2hex(&x));
+#endif
+
+      xexp = (int)(((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64);
+      ux = (ux & MANTBITS_DP64) | IMPBIT_DP64;
+
+      /* Now ux is the mantissa bit pattern of x as a long integer */
+      carry = 0;
+      mask = 1;
+      mask = (mask << bitsper) - 1;
+
+      /* Set first and last to the positions of the first
+         and last chunks of 2/pi that we need */
+      first = xexp / bitsper;
+      resexp = xexp - first * bitsper;
+      /* 180 is the theoretical maximum number of bits (actually
+         175 for IEEE double precision) that we need to extract
+         from the middle of 2/pi to compute the reduced argument
+         accurately enough for our purposes */
+      last = first + 180 / bitsper;
+
+      /* Do a long multiplication of the bits of 2/pi by the
+         integer mantissa */
+      /* Unroll the loop. This is only correct because we know
+         that bitsper is fixed as 10. */
+      res[19] = 0;
+      u = pibits[last] * ux;
+      res[18] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-1] * ux + carry;
+      res[17] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-2] * ux + carry;
+      res[16] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-3] * ux + carry;
+      res[15] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-4] * ux + carry;
+      res[14] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-5] * ux + carry;
+      res[13] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-6] * ux + carry;
+      res[12] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-7] * ux + carry;
+      res[11] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-8] * ux + carry;
+      res[10] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-9] * ux + carry;
+      res[9] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-10] * ux + carry;
+      res[8] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-11] * ux + carry;
+      res[7] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-12] * ux + carry;
+      res[6] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-13] * ux + carry;
+      res[5] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-14] * ux + carry;
+      res[4] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-15] * ux + carry;
+      res[3] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-16] * ux + carry;
+      res[2] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-17] * ux + carry;
+      res[1] = u & mask;
+      carry = u >> bitsper;
+      u = pibits[last-18] * ux + carry;
+      res[0] = u & mask;
+
+#ifdef DEBUGGING_PRINT
+      printf("resexp = %d\n", resexp);
+      printf("Significant part of x * 2/pi with binary"
+             " point in correct place:\n");
+      for (i = 0; i <= last - first; i++)
+        {
+          if (i > 0 && i % 5 == 0)
+            printf("\n ");
+          if (i == 1)
+            printf("%s ", d2b((int)res[i], bitsper, resexp));
+          else
+            printf("%s ", d2b((int)res[i], bitsper, -1));
+        }
+      printf("\n");
+#endif
+
+      /* Reconstruct the result */
+      ltb = (int)((((res[0] << bitsper) | res[1])
+                   >> (bitsper - 1 - resexp)) & 7);
+
+      /* determ says whether the fractional part is >= 0.5 */
+      determ = ltb & 1;
+
+#ifdef DEBUGGING_PRINT
+      printf("ltb = %d (last two bits before binary point"
+             " and first bit after)\n", ltb);
+      printf("determ = %d (1 means need to negate because the fractional\n"
+             "            part of x * 2/pi is greater than 0.5)\n", determ);
+#endif
+
+      i = 1;
+      if (determ)
+        {
+          /* The mantissa is >= 0.5. We want to subtract it
+             from 1.0 by negating all the bits */
+          *region = ((ltb >> 1) + 1) & 3;
+          mant = 1;
+          mant = ~(res[1]) & ((mant << (bitsper - resexp)) - 1);
+          while (mant < 0x0020000000000000)
+            {
+              i++;
+              mant = (mant << bitsper) | (~(res[i]) & mask);
+            }
+          highbitsrr = ~(res[i + 1]) << (64 - bitsper);
+        }
+      else
+        {
+          *region = (ltb >> 1);
+          mant = 1;
+          mant = res[1] & ((mant << (bitsper - resexp)) - 1);
+          while (mant < 0x0020000000000000)
+            {
+              i++;
+              mant = (mant << bitsper) | res[i];
+            }
+          highbitsrr = res[i + 1] << (64 - bitsper);
+        }
+
+      rexp = 52 + resexp - i * bitsper;
+
+      while (mant >= 0x0020000000000000)
+        {
+          rexp++;
+          highbitsrr = (highbitsrr >> 1) | ((mant & 1) << 63);
+          mant >>= 1;
+        }
+
+#ifdef DEBUGGING_PRINT
+      printf("Normalised mantissa = 0x%016lx\n", mant);
+      printf("High bits of rest of mantissa = 0x%016lx\n", highbitsrr);
+      printf("Exponent to be inserted on mantissa = rexp = %d\n", rexp);
+#endif
+
+      /* Put the result exponent rexp onto the mantissa pattern */
+      u = ((unsigned long long)rexp + EXPBIAS_DP64) << EXPSHIFTBITS_DP64;
+      ux = (mant & MANTBITS_DP64) | u;
+      if (determ)
+        /* If we negated the mantissa we negate x too */
+        ux |= SIGNBIT_DP64;
+      PUT_BITS_DP64(ux, x);
+
+      /* Create the bit pattern for rr */
+      highbitsrr >>= 12; /* Note this is shifted one place too far */
+      u = ((unsigned long long)rexp + EXPBIAS_DP64 - 53) << EXPSHIFTBITS_DP64;
+      PUT_BITS_DP64(u, t);
+      u |= highbitsrr;
+      PUT_BITS_DP64(u, xx);
+
+      /* Subtract the implicit bit we accidentally added */
+      xx -= t;
+      /* Set the correct sign, and double to account for the
+         "one place too far" shift */
+      if (determ)
+        xx *= -2.0;
+      else
+        xx *= 2.0;
+
+#ifdef DEBUGGING_PRINT
+      printf("(lead part of x*2/pi) = %25.20e = %s\n", x, double2hex(&x));
+      printf("(tail part of x*2/pi) = %25.20e = %s\n", xx, double2hex(&xx));
+#endif
+
+      /* (x,xx) is an extra-precise version of the fractional part of
+         x * 2 / pi. Multiply (x,xx) by pi/2 in extra precision
+         to get the reduced argument (r,rr). */
+      {
+        double hx, tx, c, cc;
+        /* Split x into hx (head) and tx (tail) */
+        GET_BITS_DP64(x, ux);
+        ux &= 0xfffffffff8000000;
+        PUT_BITS_DP64(ux, hx);
+        tx = x - hx;
+
+        c = piby2_lead * x;
+        cc = ((((piby2_part1 * hx - c) + piby2_part1 * tx) +
+               piby2_part2 * hx) + piby2_part2 * tx) +
+          (piby2_lead * xx + piby2_part3 * x);
+        *r = c + cc;
+        *rr = (c - *r) + cc;
+      }
+
+#ifdef DEBUGGING_PRINT
+      printf(" (r,rr) = lead and tail parts of frac(x*2/pi) * pi/2:\n");
+      printf(" r = %25.20e = %s\n", *r, double2hex(r));
+      printf("rr = %25.20e = %s\n", *rr, double2hex(rr));
+      printf("region = (number of pi/2 subtracted from x) mod 4 = %d\n",
+             *region);
+#endif
+  return;
+}

diff --git a/src/remainder_piby2d2f.c b/src/remainder_piby2d2f.c
new file mode 100644
index 0000000..59ed44a
--- /dev/null
+++ b/src/remainder_piby2d2f.c

@@ -0,0 +1,217 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#define EXPBITS_DP64      0x7ff0000000000000
+#define EXPSHIFTBITS_DP64 52
+#define EXPBIAS_DP64      1023
+#define MANTBITS_DP64     0x000fffffffffffff
+#define IMPBIT_DP64       0x0010000000000000
+#define SIGNBIT_DP64      0x8000000000000000
+
+#define PUT_BITS_DP64(ux, x) \
+  { \
+    volatile union {double d; unsigned long long i;} _bitsy; \
+    _bitsy.i = (ux); \
+    x = _bitsy.d; \
+  }
+
+/*Derived from static inline void __amd_remainder_piby2f_inline(unsigned long long ux, double *r, int *region)
+in libm_inlines_amd.h. libm_inlines.h has the pure Windows one while libm_inlines_amd.h has the mixed one.
+*/
+/* Given positive argument x, reduce it to the range [-pi/4,pi/4] using
+   extra precision, and return the result in r.
+   Return value "region" tells how many lots of pi/2 were subtracted
+   from x to put it in the range [-pi/4,pi/4], mod 4. */
+void __amd_remainder_piby2d2f(unsigned long long ux, double *r, int *region)
+{
+	/* This method simulates multi-precision floating-point
+       arithmetic and is accurate for all 1 <= x < infinity */  
+	unsigned long long u, carry, mask, mant, highbitsrr;
+    double dx; 
+	unsigned long long res[500];
+    int first, last, i, rexp, xexp, resexp, ltb, determ;
+    static const double
+		piby2 = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */
+    const int bitsper = 10;
+    static unsigned long long pibits[] =
+	{
+		0,    0,    0,    0,    0,    0,
+        162,  998,   54,  915,  580,   84,  671,  777,  855,  839,
+        851,  311,  448,  877,  553,  358,  316,  270,  260,  127,
+        593,  398,  701,  942,  965,  390,  882,  283,  570,  265,
+        221,  184,    6,  292,  750,  642,  465,  584,  463,  903,
+        491,  114,  786,  617,  830,  930,   35,  381,  302,  749,
+        72,  314,  412,  448,  619,  279,  894,  260,  921,  117,
+        569,  525,  307,  637,  156,  529,  504,  751,  505,  160,
+        945, 1022,  151, 1023,  480,  358,   15,  956,  753,   98,
+        858,   41,  721,  987,  310,  507,  242,  498,  777,  733,
+        244,  399,  870,  633,  510,  651,  373,  158,  940,  506,
+        997,  965,  947,  833,  825,  990,  165,  164,  746,  431,
+        949, 1004,  287,  565,  464,  533,  515,  193,  111,  798  
+	};
+  
+	xexp = (int)(((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64);
+    ux = (ux & MANTBITS_DP64) | IMPBIT_DP64;
+
+    /* Now ux is the mantissa bit pattern of x as a long integer */
+    mask = 1;
+    mask = (mask << bitsper) - 1;
+
+    /* Set first and last to the positions of the first
+       and last chunks of 2/pi that we need */
+    first = xexp / bitsper;
+    resexp = xexp - first * bitsper;
+    /* 180 is the theoretical maximum number of bits (actually
+       175 for IEEE double precision) that we need to extract
+       from the middle of 2/pi to compute the reduced argument
+       accurately enough for our purposes */
+    last = first + 180 / bitsper;
+
+    /* Do a long multiplication of the bits of 2/pi by the
+       integer mantissa */
+    /* Unroll the loop. This is only correct because we know
+       that bitsper is fixed as 10. */
+    res[19] = 0;
+    u = pibits[last] * ux;
+    res[18] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-1] * ux + carry;
+    res[17] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-2] * ux + carry;
+    res[16] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-3] * ux + carry;
+    res[15] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-4] * ux + carry;
+    res[14] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-5] * ux + carry;
+    res[13] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-6] * ux + carry;
+    res[12] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-7] * ux + carry;
+    res[11] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-8] * ux + carry;
+    res[10] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-9] * ux + carry;
+    res[9] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-10] * ux + carry;
+    res[8] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-11] * ux + carry;
+    res[7] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-12] * ux + carry;
+    res[6] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-13] * ux + carry;
+    res[5] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-14] * ux + carry;
+    res[4] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-15] * ux + carry;
+    res[3] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-16] * ux + carry;
+    res[2] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-17] * ux + carry;
+    res[1] = u & mask;
+    carry = u >> bitsper;
+    u = pibits[last-18] * ux + carry;
+    res[0] = u & mask;
+
+    /* Reconstruct the result */
+    ltb = (int)((((res[0] << bitsper) | res[1])
+		>> (bitsper - 1 - resexp)) & 7);
+
+    /* determ says whether the fractional part is >= 0.5 */
+    determ = ltb & 1;
+
+    i = 1;
+    if (determ)
+    {
+		/* The mantissa is >= 0.5. We want to subtract it
+           from 1.0 by negating all the bits */
+        *region = ((ltb >> 1) + 1) & 3;
+        mant = 1;
+        mant = ~(res[1]) & ((mant << (bitsper - resexp)) - 1);
+        while (mant < 0x0020000000000000)
+        {
+            i++;
+            mant = (mant << bitsper) | (~(res[i]) & mask);
+        }
+        highbitsrr = ~(res[i + 1]) << (64 - bitsper);
+	}
+    else
+    {
+        *region = (ltb >> 1);
+        mant = 1;
+        mant = res[1] & ((mant << (bitsper - resexp)) - 1);
+        while (mant < 0x0020000000000000)
+        {
+            i++;
+            mant = (mant << bitsper) | res[i];
+        }
+        highbitsrr = res[i + 1] << (64 - bitsper);
+    }
+
+    rexp = 52 + resexp - i * bitsper;
+
+    while (mant >= 0x0020000000000000)
+    {
+        rexp++;
+        highbitsrr = (highbitsrr >> 1) | ((mant & 1) << 63);
+        mant >>= 1;
+    }
+
+    /* Put the result exponent rexp onto the mantissa pattern */
+    u = ((unsigned long long)rexp + EXPBIAS_DP64) << EXPSHIFTBITS_DP64;
+    ux = (mant & MANTBITS_DP64) | u;
+    if (determ)
+      /* If we negated the mantissa we negate x too */
+      ux |= SIGNBIT_DP64;
+    PUT_BITS_DP64(ux, dx);
+
+    /* x is a double precision version of the fractional part of
+       x * 2 / pi. Multiply x by pi/2 in double precision
+       to get the reduced argument r. */
+    *r = dx * piby2;
+
+	return;
+}
+
+void __remainder_piby2d2f(unsigned long ux, double *r, int *region)
+{
+	__amd_remainder_piby2d2f((unsigned long long) ux, r, region);
+}
+

diff --git a/src/rint.c b/src/rint.c
new file mode 100644
index 0000000..770685f
--- /dev/null
+++ b/src/rint.c

@@ -0,0 +1,69 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+
+
+
+double FN_PROTOTYPE(rint)(double x)
+{
+
+    UT64 checkbits,val_2p52;
+	UT32 sign;
+    checkbits.f64=x;
+
+    /* Clear the sign bit and check if the value can be rounded(i.e check if exponent less than 52) */
+    if( (checkbits.u64 & 0x7FFFFFFFFFFFFFFF) > 0x4330000000000000)
+    {
+      /* take care of nan or inf */
+      if((checkbits.u32[1] & 0x7ff00000)== 0x7ff00000)
+          return x+x;
+      else
+          return x;
+    }
+
+    sign.u32 =  checkbits.u32[1] & 0x80000000;
+    val_2p52.u32[1] = sign.u32 | 0x43300000;
+    val_2p52.u32[0] = 0;
+
+	/* Add and sub 2^52 to round the number according to the current rounding direction */
+    val_2p52.f64 = (x + val_2p52.f64) - val_2p52.f64;
+
+    /*This extra line is to take care of denormals and various rounding modes*/
+    val_2p52.u32[1] = ((val_2p52.u32[1] << 1) >> 1) | sign.u32;
+
+     if(x!=val_2p52.f64)
+	{
+   	     /* Raise floating-point inexact exception if the result differs in value from the argument */
+      	    checkbits.u64 = QNANBITPATT_DP64;
+     	    checkbits.f64 = checkbits.f64 +  checkbits.f64;        /* raise inexact exception by adding two nan numbers.*/
+	}
+
+
+    return (val_2p52.f64);
+}
+
+
+
+

diff --git a/src/rintf.c b/src/rintf.c
new file mode 100644
index 0000000..e048c11
--- /dev/null
+++ b/src/rintf.c

@@ -0,0 +1,65 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+
+
+
+float FN_PROTOTYPE(rintf)(float x)
+{
+
+    UT32 checkbits,sign,val_2p23;
+    checkbits.f32=x;
+
+    /* Clear the sign bit and check if the value can be rounded */
+    if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000)
+    {
+      /* Number exceeds the representable range could be nan or inf also*/
+      /* take care of nan or inf */
+      if((checkbits.u32 & 0x7f800000)== 0x7f800000)
+          return x+x;
+      else
+          return x;
+    }
+
+    sign.u32 =  checkbits.u32 & 0x80000000;
+    val_2p23.u32 = (checkbits.u32 & 0x80000000) | 0x4B000000;
+
+   /* Add and sub 2^23 to round the number according to the current rounding direction */
+    val_2p23.f32  = ((x + val_2p23.f32) - val_2p23.f32);
+
+    /*This extra line is to take care of denormals and various rounding modes*/
+    val_2p23.u32 = ((val_2p23.u32 << 1) >> 1) | sign.u32;
+
+    if (val_2p23.f32 != x)
+    {
+        /* Raise floating-point inexact exception if the result differs in value from the argument */
+         checkbits.u32 = 0xFFC00000;
+         checkbits.f32 = checkbits.f32 +  checkbits.f32;        /* raise inexact exception by adding two nan numbers.*/
+    }
+
+
+    return val_2p23.f32;
+}
+

diff --git a/src/roundf.c b/src/roundf.c
new file mode 100644
index 0000000..596c381
--- /dev/null
+++ b/src/roundf.c

@@ -0,0 +1,97 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+float FN_PROTOTYPE(roundf)(float f)
+{
+    UT32 u32f, u32Temp;
+    U32 u32sign, u32exp, u32mantissa;
+    int intexp;            /*Needs to be signed */
+    u32f.f32 = f;
+    u32sign = u32f.u32 & SIGNBIT_SP32;
+    if ((u32f.u32 & 0X7F800000) == 0x7F800000)
+    {
+        //u32f.f32 = f;
+        /*Return Quiet Nan.
+         * Quiet the signalling nan*/
+        if(!((u32f.u32 & MANTBITS_SP32) == 0))
+            u32f.u32 |= QNAN_MASK_32;
+        /*else the number is infinity*/
+        //Raise range or domain error
+
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = f;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "roundf", f, is_x_snan, 0.0F , 0,u32f.f32, 0);
+        }
+
+		
+		return u32f.f32;
+    }
+    /*Get the exponent of the input*/
+    intexp = (u32f.u32 & 0x7f800000) >> 23;
+    intexp -= 0x7F;
+    /*If exponent is greater than 22 then the number is already
+      rounded*/
+    if (intexp > 22)
+        return f;
+    if (intexp < 0)
+    {
+        u32Temp.f32 = f;
+        u32Temp.u32 &= 0x7FFFFFFF;
+        /*Add with a large number (2^23 +1) = 8388609.0F 
+        to force an overflow*/
+        u32Temp.f32 = (u32Temp.f32 + 8388609.0F);
+        /*Substract back with t he large number*/
+        u32Temp.f32 -= 8388609;
+        if (u32sign)
+            u32Temp.u32 |= 0x80000000;
+        return u32Temp.f32;
+    }
+    else
+    {
+        /*if(intexp == -1)
+            u32exp = 0x3F800000;       */
+        u32f.u32 &= 0x7FFFFFFF;
+        u32f.f32 += 0.5;
+        u32exp = u32f.u32 & 0x7F800000;
+        /*right shift then left shift to discard the decimal
+          places*/
+        u32mantissa = (u32f.u32 & MANTBITS_SP32) >> (23 - intexp);
+        u32mantissa = u32mantissa << (23 - intexp);
+        u32Temp.u32 = u32sign | u32exp | u32mantissa;
+        return (u32Temp.f32);
+    }
+}
+

diff --git a/src/scalbln.c b/src/scalbln.c
new file mode 100644
index 0000000..51499d8
--- /dev/null
+++ b/src/scalbln.c

@@ -0,0 +1,119 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+
+double FN_PROTOTYPE(scalbln)(double x, long int n)
+{
+    UT64 val;
+    unsigned int sign;
+    int exponent;
+    val.f64 = x;
+    sign = val.u32[1] & 0x80000000;
+    val.u32[1] = val.u32[1] & 0x7fffffff; /* remove the sign bit */
+
+    if((val.u32[1] & 0x7ff00000)== 0x7ff00000)/* x= nan or x = +-inf*/
+        return x+x;
+
+    if((val.u64 == 0x0000000000000000) || (n==0))
+        return x; /* x= +-0 or n= 0*/
+
+    exponent = val.u32[1] >> 20; /* get the exponent */
+
+    if(exponent == 0)/*x is denormal*/
+    {
+		val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/
+        exponent = val.u32[1] >> 20; /* get the exponent */
+		exponent = exponent + n - MULTIPLIER_DP;
+		if(exponent < -MULTIPLIER_DP)/*underflow*/
+		{
+			val.u32[1] = sign | 0x00000000;
+			val.u32[0] = 0x00000000;
+
+			__amd_handle_error(UNDERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64);
+			
+			return val.f64;
+		}
+		if(exponent > 2046)/*overflow*/
+		{
+			val.u32[1] = sign | 0x7ff00000;
+			val.u32[0] = 0x00000000;
+
+			__amd_handle_error(OVERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64);
+			
+
+			return val.f64;
+		}
+
+		exponent += MULTIPLIER_DP;
+		val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+		val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+        return val.f64;
+    }
+
+    exponent += n;
+
+    if(exponent < -MULTIPLIER_DP)/*underflow*/
+	{
+		val.u32[1] = sign | 0x00000000;
+		val.u32[0] = 0x00000000;
+
+		__amd_handle_error(UNDERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64);
+		
+		return val.f64;
+	}
+
+    if(exponent < 1)/*x is normal but output is debnormal*/
+    {
+		exponent += MULTIPLIER_DP;
+		val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+		val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+        return val.f64;
+    }
+
+    if(exponent > 2046)/*overflow*/
+	{
+		val.u32[1] = sign | 0x7ff00000;
+		val.u32[0] = 0x00000000;
+
+		__amd_handle_error(OVERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64);
+		
+
+		return val.f64;
+	}
+
+    val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+    return val.f64;
+}
+

diff --git a/src/scalblnf.c b/src/scalblnf.c
new file mode 100644
index 0000000..cc627bb
--- /dev/null
+++ b/src/scalblnf.c

@@ -0,0 +1,133 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+float FN_PROTOTYPE(scalblnf)(float x, long int n)
+{
+    UT32 val;
+    unsigned int sign;
+    int exponent;
+    val.f32 = x;
+    sign = val.u32 & 0x80000000;
+    val.u32 = val.u32 & 0x7fffffff;/* remove the sign bit */
+
+    if((val.u32 & 0x7f800000)== 0x7f800000)/* x= nan or x = +-inf*/
+        return x+x;
+
+    if((val.u32 == 0x00000000) || (n==0))/* x= +-0 or n= 0*/
+        return x;
+
+    exponent = val.u32 >> 23; /* get the exponent */
+
+	if(exponent == 0)/*x is denormal*/
+	{
+		val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/
+		exponent = (val.u32 >> 23); /* get the exponent */
+		exponent = exponent + n - MULTIPLIER_SP;
+		if(exponent < -MULTIPLIER_SP)/*underflow*/
+		{
+			val.u32 = sign | 0x00000000;
+
+            {
+                unsigned int is_x_snan;
+                UT32 xm; xm.f32 = x;
+                is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+                __amd_handle_errorf(UNDERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0);
+            }
+			
+			return val.f32;
+		}
+		if(exponent > 254)/*overflow*/
+		{
+			val.u32 = sign | 0x7f800000;
+
+            {
+                unsigned int is_x_snan;
+                UT32 xm; xm.f32 = x;
+                is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+                __amd_handle_errorf(OVERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0);
+            }
+    		
+			return val.f32;
+		}
+
+		exponent += MULTIPLIER_SP;
+		val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+		val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+        return val.f32;
+	}
+
+    exponent += n;
+
+    if(exponent < -MULTIPLIER_SP)/*underflow*/
+	{
+		val.u32 = sign | 0x00000000;
+
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(UNDERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0);
+        }
+		
+		return val.f32;
+	}
+
+    if(exponent < 1)/*x is normal but output is debnormal*/
+    {
+		exponent += MULTIPLIER_SP;
+		val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+		val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+        return val.f32;
+    }
+
+    if(exponent > 254)/*overflow*/
+	{
+		val.u32 = sign | 0x7f800000;
+
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(OVERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0);
+        }
+		
+		return val.f32;
+	}
+
+    val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);/*x is normal and output is normal*/
+    return val.f32;
+}
+

diff --git a/src/scalbn.c b/src/scalbn.c
new file mode 100644
index 0000000..facb718
--- /dev/null
+++ b/src/scalbn.c

@@ -0,0 +1,117 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+
+
+double FN_PROTOTYPE(scalbn)(double x, int n)
+{
+    UT64 val;
+    unsigned int sign;
+    int exponent;
+    val.f64 = x;
+    sign = val.u32[1] & 0x80000000;
+    val.u32[1] = val.u32[1] & 0x7fffffff; /* remove the sign bit */
+
+    if((val.u32[1] & 0x7ff00000)== 0x7ff00000)/* x= nan or x = +-inf*/
+        return x+x;
+
+    if((val.u64 == 0x0000000000000000) || (n==0))
+        return x; /* x= +-0 or n= 0*/
+
+    exponent = val.u32[1] >> 20; /* get the exponent */
+
+    if(exponent == 0)/*x is denormal*/
+    {
+		val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/
+        exponent = val.u32[1] >> 20; /* get the exponent */
+		exponent = exponent + n - MULTIPLIER_DP;
+		if(exponent < -MULTIPLIER_DP)/*underflow*/
+		{
+			val.u32[1] = sign | 0x00000000;
+			val.u32[0] = 0x00000000;
+			__amd_handle_error(UNDERFLOW, ERANGE, "scalbn", x, (double) n , val.f64);
+			
+			return val.f64;
+		}
+		if(exponent > 2046)/*overflow*/
+		{
+			val.u32[1] = sign | 0x7ff00000;
+			val.u32[0] = 0x00000000;
+
+			__amd_handle_error(OVERFLOW, ERANGE, "scalbn", x, (double) n , val.f64);
+			
+			return val.f64;
+		}
+
+		exponent += MULTIPLIER_DP;
+		val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+		val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+        return val.f64;
+    }
+
+    exponent += n;
+
+    if(exponent < -MULTIPLIER_DP)/*underflow*/
+	{
+       val.u32[1] = sign | 0x00000000;
+       val.u32[0] = 0x00000000;
+
+		__amd_handle_error(UNDERFLOW, ERANGE, "scalbn", x, (double) n , val.f64);
+		
+       return val.f64;
+	}
+
+    if(exponent < 1)/*x is normal but output is debnormal*/
+    {
+		exponent += MULTIPLIER_DP;
+		val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+		val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+        return val.f64;
+    }
+
+    if(exponent > 2046)/*overflow*/
+	{
+		val.u32[1] = sign | 0x7ff00000;
+		val.u32[0] = 0x00000000;
+
+		__amd_handle_error(OVERFLOW, ERANGE, "scalbn", x, (double) n , val.f64);
+		
+		return val.f64;
+	}
+
+    val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+    return val.f64;
+}
+

diff --git a/src/scalbnf.c b/src/scalbnf.c
new file mode 100644
index 0000000..1477fe1
--- /dev/null
+++ b/src/scalbnf.c

@@ -0,0 +1,138 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+float FN_PROTOTYPE(scalbnf)(float x, int n)
+{
+    UT32 val;
+    unsigned int sign;
+    int exponent;
+    val.f32 = x;
+    sign = val.u32 & 0x80000000;
+    val.u32 = val.u32 & 0x7fffffff;/* remove the sign bit */
+
+    if((val.u32 & 0x7f800000)== 0x7f800000)/* x= nan or x = +-inf*/
+        return x+x;
+
+    if((val.u32 == 0x00000000) || (n==0))/* x= +-0 or n= 0*/
+        return x;
+
+    exponent = val.u32 >> 23; /* get the exponent */
+
+	if(exponent == 0)/*x is denormal*/
+	{
+		val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/
+		exponent = (val.u32 >> 23); /* get the exponent */
+		exponent = exponent + n - MULTIPLIER_SP;
+		if(exponent < -MULTIPLIER_SP)/*underflow*/
+		{
+			val.u32 = sign | 0x00000000;
+
+            {
+                unsigned int is_x_snan;
+                UT32 xm; xm.f32 = x;
+                is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+                __amd_handle_errorf(UNDERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0);
+
+            }
+			
+			return val.f32;
+		}
+		if(exponent > 254)/*overflow*/
+		{
+			val.u32 = sign | 0x7f800000;
+
+            {
+                unsigned int is_x_snan;
+                UT32 xm; xm.f32 = x;
+                is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+                __amd_handle_errorf(OVERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0);
+
+            }
+
+			
+			return val.f32;
+		}
+
+		exponent += MULTIPLIER_SP;
+		val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+		val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+        return val.f32;
+	}
+
+    exponent += n;
+
+    if(exponent < -MULTIPLIER_SP)/*underflow*/
+	{
+		val.u32 = sign | 0x00000000;
+
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(UNDERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0);
+
+        }
+
+		
+		return val.f32;
+	}
+
+    if(exponent < 1)/*x is normal but output is debnormal*/
+    {
+		exponent += MULTIPLIER_SP;
+		val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+		val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+        return val.f32;
+    }
+
+    if(exponent > 254)/*overflow*/
+	{
+		val.u32 = sign | 0x7f800000;
+
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(OVERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0);
+
+        }
+
+	    return val.f32;
+	}
+
+    val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);/*x is normal and output is normal*/
+    return val.f32;
+}
+

diff --git a/src/sincos_special.c b/src/sincos_special.c
new file mode 100644
index 0000000..c349d10
--- /dev/null
+++ b/src/sincos_special.c

@@ -0,0 +1,151 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include <emmintrin.h>
+#include <math.h>
+#include <errno.h>
+
+
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+double _sin_cos_special(double x, const char *name)
+{
+    UT64 xu;
+	unsigned int is_snan;
+
+	xu.f64 = x;
+
+    if((xu.u64 & EXPBITS_DP64) == EXPBITS_DP64)
+    {
+        // x is Inf or NaN
+        if((xu.u64 & MANTBITS_DP64) == 0x0)
+        {
+            // x is Inf
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID); 
+#ifdef WIN64
+            xu.u64 = INDEFBITPATT_DP64;
+			__amd_handle_error(DOMAIN, EDOM, name, x, 0, xu.f64);
+#else
+			xu.u64 = QNANBITPATT_DP64;
+            name = *(&name); // dummy statement to avoid warning
+#endif
+		}
+		else {
+			// x is NaN
+            is_snan = (((xu.u64 & QNAN_MASK_64) == QNAN_MASK_64) ? 0 : 1);
+			if(is_snan){
+				xu.u64 |= QNAN_MASK_64;
+#ifdef WIN64
+#else
+				_mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+			}
+#ifdef WIN64
+			__amd_handle_error(DOMAIN, EDOM, name, x, 0, xu.f64);
+#endif
+		}
+		
+	}
+
+	return xu.f64;
+}
+
+float _sinf_cosf_special(float x, const char *name)
+{
+    UT32 xu;
+	unsigned int is_snan;
+
+	xu.f32 = x;
+
+    if((xu.u32 & EXPBITS_SP32) == EXPBITS_SP32)
+    {
+        // x is Inf or NaN
+        if((xu.u32 & MANTBITS_SP32) == 0x0)
+        {
+            // x is Inf	
+            _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#ifdef WIN64
+            xu.u32 = INDEFBITPATT_SP32;
+			__amd_handle_errorf(DOMAIN, EDOM, name, x, 0, 0.0f, 0, xu.f32, 0);
+#else
+			xu.u32 = QNANBITPATT_SP32; 
+            name = *(&name); // dummy statement to avoid warning
+#endif
+		}
+		else {
+			// x is NaN
+            is_snan = (((xu.u32 & QNAN_MASK_32) == QNAN_MASK_32) ? 0 : 1);
+			if(is_snan) {
+				xu.u32 |= QNAN_MASK_32;
+				_mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+			}
+#ifdef WIN64
+			__amd_handle_errorf(DOMAIN, EDOM, name, x, is_snan, 0.0f, 0, xu.f32, 0);
+#endif
+		}
+		
+	}
+
+	return xu.f32;
+}
+
+float _sinf_special(float x)
+{
+	return _sinf_cosf_special(x, "sinf");
+}
+
+double _sin_special(double x)
+{
+	return _sin_cos_special(x, "sin");
+}
+
+float _cosf_special(float x)
+{
+	return _sinf_cosf_special(x, "cosf");
+}
+
+double _cos_special(double x)
+{
+	return _sin_cos_special(x, "cos");
+}
+
+void _sincosf_special(float x, float *sy, float *cy)
+{
+    float xu = _sinf_cosf_special(x, "sincosf");
+
+	*sy = xu;
+	*cy = xu;
+
+	return;
+}
+
+void _sincos_special(double x, double *sy, double *cy)
+{
+    double xu = _sin_cos_special(x, "sincos");
+
+	*sy = xu;
+	*cy = xu;
+
+	return;
+}

diff --git a/src/sinh.c b/src/sinh.c
new file mode 100644
index 0000000..f22fee4
--- /dev/null
+++ b/src/sinh.c

@@ -0,0 +1,371 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_SPLITEXP
+#define USE_SCALEDOUBLE_1
+#define USE_SCALEDOUBLE_2
+#define USE_INFINITY_WITH_FLAGS
+#define USE_VAL_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERROR
+#undef USE_SPLITEXP
+#undef USE_SCALEDOUBLE_1
+#undef USE_SCALEDOUBLE_2
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+
+/* Deal with errno for out-of-range result */
+static inline double retval_errno_erange(double x, int xneg)
+{
+  struct exception exc;
+  exc.arg1 = x;
+  exc.arg2 = x;
+  exc.type = OVERFLOW;
+  exc.name = (char *)"sinh";
+  if (_LIB_VERSION == _SVID_)
+    {
+      if (xneg)
+        exc.retval = -HUGE;
+      else
+        exc.retval = HUGE;
+    }
+  else
+    {
+      if (xneg)
+        exc.retval = -infinity_with_flags(AMD_F_OVERFLOW);
+      else
+        exc.retval = infinity_with_flags(AMD_F_OVERFLOW);
+    }
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(ERANGE);
+  else if (!matherr(&exc))
+    __set_errno(ERANGE);
+  return exc.retval;
+}
+#endif
+
+double FN_PROTOTYPE(sinh)(double x)
+{
+  /*
+    After dealing with special cases the computation is split into
+    regions as follows:
+
+    abs(x) >= max_sinh_arg:
+    sinh(x) = sign(x)*Inf
+
+    abs(x) >= small_threshold:
+    sinh(x) = sign(x)*exp(abs(x))/2 computed using the
+    splitexp and scaleDouble functions as for exp_amd().
+
+    abs(x) < small_threshold:
+    compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0)))
+    sinh(x) is then sign(x)*z.                             */
+
+  static const double
+    max_sinh_arg = 7.10475860073943977113e+02, /* 0x408633ce8fb9f87e */
+    thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */
+    log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */
+    log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */
+    small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889;
+  /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */
+
+  /* Lead and tail tabulated values of sinh(i) and cosh(i)
+     for i = 0,...,36. The lead part has 26 leading bits. */
+
+  static const double sinh_lead[37] = {
+    0.00000000000000000000e+00,  /* 0x0000000000000000 */
+    1.17520117759704589844e+00,  /* 0x3ff2cd9fc0000000 */
+    3.62686038017272949219e+00,  /* 0x400d03cf60000000 */
+    1.00178747177124023438e+01,  /* 0x40240926e0000000 */
+    2.72899169921875000000e+01,  /* 0x403b4a3800000000 */
+    7.42032089233398437500e+01,  /* 0x40528d0160000000 */
+    2.01713153839111328125e+02,  /* 0x406936d228000000 */
+    5.48316116333007812500e+02,  /* 0x4081228768000000 */
+    1.49047882080078125000e+03,  /* 0x409749ea50000000 */
+    4.05154187011718750000e+03,  /* 0x40afa71570000000 */
+    1.10132326660156250000e+04,  /* 0x40c5829dc8000000 */
+    2.99370708007812500000e+04,  /* 0x40dd3c4488000000 */
+    8.13773945312500000000e+04,  /* 0x40f3de1650000000 */
+    2.21206695312500000000e+05,  /* 0x410b00b590000000 */
+    6.01302140625000000000e+05,  /* 0x412259ac48000000 */
+    1.63450865625000000000e+06,  /* 0x4138f0cca8000000 */
+    4.44305525000000000000e+06,  /* 0x4150f2ebd0000000 */
+    1.20774762500000000000e+07,  /* 0x4167093488000000 */
+    3.28299845000000000000e+07,  /* 0x417f4f2208000000 */
+    8.92411500000000000000e+07,  /* 0x419546d8f8000000 */
+    2.42582596000000000000e+08,  /* 0x41aceb0888000000 */
+    6.59407856000000000000e+08,  /* 0x41c3a6e1f8000000 */
+    1.79245641600000000000e+09,  /* 0x41dab5adb8000000 */
+    4.87240166400000000000e+09,  /* 0x41f226af30000000 */
+    1.32445608960000000000e+10,  /* 0x4208ab7fb0000000 */
+    3.60024494080000000000e+10,  /* 0x4220c3d390000000 */
+    9.78648043520000000000e+10,  /* 0x4236c93268000000 */
+    2.66024116224000000000e+11,  /* 0x424ef822f0000000 */
+    7.23128516608000000000e+11,  /* 0x42650bba30000000 */
+    1.96566712320000000000e+12,  /* 0x427c9aae40000000 */
+    5.34323724288000000000e+12,  /* 0x4293704708000000 */
+    1.45244246507520000000e+13,  /* 0x42aa6b7658000000 */
+    3.94814795284480000000e+13,  /* 0x42c1f43fc8000000 */
+    1.07321789251584000000e+14,  /* 0x42d866f348000000 */
+    2.91730863685632000000e+14,  /* 0x42f0953e28000000 */
+    7.93006722514944000000e+14,  /* 0x430689e220000000 */
+    2.15561576592179200000e+15}; /* 0x431ea215a0000000 */
+
+  static const double sinh_tail[37] = {
+    0.00000000000000000000e+00,  /* 0x0000000000000000 */
+    1.60467555584448807892e-08,  /* 0x3e513ae6096a0092 */
+    2.76742892754807136947e-08,  /* 0x3e5db70cfb79a640 */
+    2.09697499555224576530e-07,  /* 0x3e8c2526b66dc067 */
+    2.04940252448908240062e-07,  /* 0x3e8b81b18647f380 */
+    1.65444891522700935932e-06,  /* 0x3ebbc1cdd1e1eb08 */
+    3.53116789999998198721e-06,  /* 0x3ecd9f201534fb09 */
+    6.94023870987375490695e-06,  /* 0x3edd1c064a4e9954 */
+    4.98876893611587449271e-06,  /* 0x3ed4eca65d06ea74 */
+    3.19656024605152215752e-05,  /* 0x3f00c259bcc0ecc5 */
+    2.08687768377236501204e-04,  /* 0x3f2b5a6647cf9016 */
+    4.84668088325403796299e-05,  /* 0x3f09691adefb0870 */
+    1.17517985422733832468e-03,  /* 0x3f53410fc29cde38 */
+    6.90830086959560562415e-04,  /* 0x3f46a31a50b6fb3c */
+    1.45697262451506548420e-03,  /* 0x3f57defc71805c40 */
+    2.99859023684906737806e-02,  /* 0x3f9eb49fd80e0bab */
+    1.02538800507941396667e-02,  /* 0x3f84fffc7bcd5920 */
+    1.26787628407699110022e-01,  /* 0x3fc03a93b6c63435 */
+    6.86652479544033744752e-02,  /* 0x3fb1940bb255fd1c */
+    4.81593627621056619148e-01,  /* 0x3fded26e14260b50 */
+    1.70489513795397629181e+00,  /* 0x3ffb47401fc9f2a2 */
+    1.12416073482258713767e+01,  /* 0x40267bb3f55634f1 */
+    7.06579578070110514432e+00,  /* 0x401c435ff8194ddc */
+    5.91244512999659974639e+01,  /* 0x404d8fee052ba63a */
+    1.68921736147050694399e+02,  /* 0x40651d7edccde3f6 */
+    2.60692936262073658327e+02,  /* 0x40704b1644557d1a */
+    3.62419382134885609048e+02,  /* 0x4076a6b5ca0a9dc4 */
+    4.07689930834187271103e+03,  /* 0x40afd9cc72249aba */
+    1.55377375868385224749e+04,  /* 0x40ce58de693edab5 */
+    2.53720210371943067003e+04,  /* 0x40d8c70158ac6363 */
+    4.78822310734952334315e+04,  /* 0x40e7614764f43e20 */
+    1.81871712615542812273e+05,  /* 0x4106337db36fc718 */
+    5.62892347580489004031e+05,  /* 0x41212d98b1f611e2 */
+    6.41374032312148716301e+05,  /* 0x412392bc108b37cc */
+    7.57809544070145115256e+06,  /* 0x415ce87bdc3473dc */
+    3.64177136406482197344e+06,  /* 0x414bc8d5ae99ad14 */
+    7.63580561355670914054e+06}; /* 0x415d20d76744835c */
+
+  static const double cosh_lead[37] = {
+    1.00000000000000000000e+00,  /* 0x3ff0000000000000 */
+    1.54308062791824340820e+00,  /* 0x3ff8b07550000000 */
+    3.76219564676284790039e+00,  /* 0x400e18fa08000000 */
+    1.00676617622375488281e+01,  /* 0x402422a490000000 */
+    2.73082327842712402344e+01,  /* 0x403b4ee858000000 */
+    7.42099475860595703125e+01,  /* 0x40528d6fc8000000 */
+    2.01715633392333984375e+02,  /* 0x406936e678000000 */
+    5.48317031860351562500e+02,  /* 0x4081228948000000 */
+    1.49047915649414062500e+03,  /* 0x409749eaa8000000 */
+    4.05154199218750000000e+03,  /* 0x40afa71580000000 */
+    1.10132329101562500000e+04,  /* 0x40c5829dd0000000 */
+    2.99370708007812500000e+04,  /* 0x40dd3c4488000000 */
+    8.13773945312500000000e+04,  /* 0x40f3de1650000000 */
+    2.21206695312500000000e+05,  /* 0x410b00b590000000 */
+    6.01302140625000000000e+05,  /* 0x412259ac48000000 */
+    1.63450865625000000000e+06,  /* 0x4138f0cca8000000 */
+    4.44305525000000000000e+06,  /* 0x4150f2ebd0000000 */
+    1.20774762500000000000e+07,  /* 0x4167093488000000 */
+    3.28299845000000000000e+07,  /* 0x417f4f2208000000 */
+    8.92411500000000000000e+07,  /* 0x419546d8f8000000 */
+    2.42582596000000000000e+08,  /* 0x41aceb0888000000 */
+    6.59407856000000000000e+08,  /* 0x41c3a6e1f8000000 */
+    1.79245641600000000000e+09,  /* 0x41dab5adb8000000 */
+    4.87240166400000000000e+09,  /* 0x41f226af30000000 */
+    1.32445608960000000000e+10,  /* 0x4208ab7fb0000000 */
+    3.60024494080000000000e+10,  /* 0x4220c3d390000000 */
+    9.78648043520000000000e+10,  /* 0x4236c93268000000 */
+    2.66024116224000000000e+11,  /* 0x424ef822f0000000 */
+    7.23128516608000000000e+11,  /* 0x42650bba30000000 */
+    1.96566712320000000000e+12,  /* 0x427c9aae40000000 */
+    5.34323724288000000000e+12,  /* 0x4293704708000000 */
+    1.45244246507520000000e+13,  /* 0x42aa6b7658000000 */
+    3.94814795284480000000e+13,  /* 0x42c1f43fc8000000 */
+    1.07321789251584000000e+14,  /* 0x42d866f348000000 */
+    2.91730863685632000000e+14,  /* 0x42f0953e28000000 */
+    7.93006722514944000000e+14,  /* 0x430689e220000000 */
+    2.15561576592179200000e+15}; /* 0x431ea215a0000000 */
+
+  static const double cosh_tail[37] = {
+    0.00000000000000000000e+00,  /* 0x0000000000000000 */
+    6.89700037027478056904e-09,  /* 0x3e3d9f5504c2bd28 */
+    4.43207835591715833630e-08,  /* 0x3e67cb66f0a4c9fd */
+    2.33540217013828929694e-07,  /* 0x3e8f58617928e588 */
+    5.17452463948269748331e-08,  /* 0x3e6bc7d000c38d48 */
+    9.38728274131605919153e-07,  /* 0x3eaf7f9d4e329998 */
+    2.73012191010840495544e-06,  /* 0x3ec6e6e464885269 */
+    3.29486051438996307950e-06,  /* 0x3ecba3a8b946c154 */
+    4.75803746362771416375e-06,  /* 0x3ed3f4e76110d5a4 */
+    3.33050940471947692369e-05,  /* 0x3f017622515a3e2b */
+    9.94707313972136215365e-06,  /* 0x3ee4dc4b528af3d0 */
+    6.51685096227860253398e-05,  /* 0x3f11156278615e10 */
+    1.18132406658066663359e-03,  /* 0x3f535ad50ed821f5 */
+    6.93090416366541877541e-04,  /* 0x3f46b61055f2935c */
+    1.45780415323416845386e-03,  /* 0x3f57e2794a601240 */
+    2.99862082708111758744e-02,  /* 0x3f9eb4b45f6aadd3 */
+    1.02539925859688602072e-02,  /* 0x3f85000b967b3698 */
+    1.26787669807076286421e-01,  /* 0x3fc03a940fadc092 */
+    6.86652631843830962843e-02,  /* 0x3fb1940bf3bf874c */
+    4.81593633223853068159e-01,  /* 0x3fded26e1a2a2110 */
+    1.70489514001513020602e+00,  /* 0x3ffb4740205796d6 */
+    1.12416073489841270572e+01,  /* 0x40267bb3f55cb85d */
+    7.06579578098005001152e+00,  /* 0x401c435ff81e18ac */
+    5.91244513000686140458e+01,  /* 0x404d8fee052bdea4 */
+    1.68921736147088438429e+02,  /* 0x40651d7edccde926 */
+    2.60692936262087528121e+02,  /* 0x40704b1644557e0e */
+    3.62419382134890611269e+02,  /* 0x4076a6b5ca0a9e1c */
+    4.07689930834187453002e+03,  /* 0x40afd9cc72249abe */
+    1.55377375868385224749e+04,  /* 0x40ce58de693edab5 */
+    2.53720210371943103382e+04,  /* 0x40d8c70158ac6364 */
+    4.78822310734952334315e+04,  /* 0x40e7614764f43e20 */
+    1.81871712615542812273e+05,  /* 0x4106337db36fc718 */
+    5.62892347580489004031e+05,  /* 0x41212d98b1f611e2 */
+    6.41374032312148716301e+05,  /* 0x412392bc108b37cc */
+    7.57809544070145115256e+06,  /* 0x415ce87bdc3473dc */
+    3.64177136406482197344e+06,  /* 0x414bc8d5ae99ad14 */
+    7.63580561355670914054e+06}; /* 0x415d20d76744835c */
+
+  unsigned long long ux, aux, xneg;
+  double y, z, z1, z2;
+  int m;
+
+  /* Special cases */
+
+  GET_BITS_DP64(x, ux);
+  aux = ux & ~SIGNBIT_DP64;
+  if (aux < 0x3e30000000000000) /* |x| small enough that sinh(x) = x */
+    {
+      if (aux == 0)
+        /* with no inexact */
+        return x;
+      else
+        return val_with_flags(x, AMD_F_INEXACT);
+    }
+  else if (aux >= 0x7ff0000000000000) /* |x| is NaN or Inf */
+    {
+      return x + x;
+    }
+
+
+  xneg = (aux != ux);
+
+  y = x;
+  if (xneg) y = -x;
+
+  if (y >= max_sinh_arg)
+    {
+      /* Return +/-infinity with overflow flag */
+
+#ifdef WINDOWS
+      if (xneg)
+         return handle_error("sinh", NINFBITPATT_DP64, _OVERFLOW,
+                              AMD_F_OVERFLOW, EDOM, x, 0.0F);
+      else
+         return handle_error("sinh", PINFBITPATT_DP64, _OVERFLOW,
+                              AMD_F_OVERFLOW, ERANGE, x, 0.0F);
+#else
+      return retval_errno_erange(x, xneg);
+#endif
+    }
+  else if (y >= small_threshold)
+    {
+      /* In this range y is large enough so that
+         the negative exponential is negligible,
+         so sinh(y) is approximated by sign(x)*exp(y)/2. The
+         code below is an inlined version of that from
+         exp() with two changes (it operates on
+         y instead of x, and the division by 2 is
+         done by reducing m by 1). */
+
+      splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead,
+               log2_by_32_tail, &m, &z1, &z2);
+      m -= 1;
+
+      if (m >= EMIN_DP64 && m <= EMAX_DP64)
+        z = scaleDouble_1((z1+z2),m);
+      else
+        z = scaleDouble_2((z1+z2),m);
+    }
+  else
+    {
+      /* In this range we find the integer part y0 of y
+         and the increment dy = y - y0. We then compute
+
+         z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy)
+
+         where sinh(y0) and cosh(y0) are tabulated above. */
+
+      int ind;
+      double dy, dy2, sdy, cdy, sdy1, sdy2;
+
+      ind = (int)y;
+      dy = y - ind;
+
+      dy2 = dy*dy;
+      sdy = dy*dy2*(0.166666666666666667013899e0 +
+                    (0.833333333333329931873097e-2 +
+                     (0.198412698413242405162014e-3 +
+                      (0.275573191913636406057211e-5 +
+                       (0.250521176994133472333666e-7 +
+                        (0.160576793121939886190847e-9 +
+                         0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+      cdy = dy2*(0.500000000000000005911074e0 +
+                 (0.416666666666660876512776e-1 +
+                  (0.138888888889814854814536e-2 +
+                   (0.248015872460622433115785e-4 +
+                    (0.275573350756016588011357e-6 +
+                     (0.208744349831471353536305e-8 +
+                      0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+      /* At this point sinh(dy) is approximated by dy + sdy.
+	 Shift some significant bits from dy to sdy. */
+
+      GET_BITS_DP64(dy, ux);
+      ux &= 0xfffffffff8000000;
+      PUT_BITS_DP64(ux, sdy1);
+      sdy2 = sdy + (dy - sdy1);
+
+      z = ((((((cosh_tail[ind]*sdy2 + sinh_tail[ind]*cdy)
+	       + cosh_tail[ind]*sdy1) + sinh_tail[ind])
+	     + cosh_lead[ind]*sdy2) + sinh_lead[ind]*cdy)
+	   + cosh_lead[ind]*sdy1) + sinh_lead[ind];
+    }
+
+  if (xneg) z = - z;
+  return z;
+}
+
+weak_alias (__sinh, sinh)

diff --git a/src/sinhf.c b/src/sinhf.c
new file mode 100644
index 0000000..eaad0fd
--- /dev/null
+++ b/src/sinhf.c

@@ -0,0 +1,292 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_SPLITEXP
+#define USE_SCALEDOUBLE_1
+#define USE_INFINITY_WITH_FLAGS
+#define USE_VALF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_SPLITEXP
+#undef USE_SCALEDOUBLE_1
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline float retval_errno_erange(float x, int xneg)
+{
+  struct exception exc;
+  exc.arg1 = (double)x;
+  exc.arg2 = (double)x;
+  exc.type = OVERFLOW;
+  exc.name = (char *)"sinhf";
+  if (_LIB_VERSION == _SVID_)
+    {
+      if (xneg)
+        exc.retval = -HUGE;
+      else
+        exc.retval = HUGE;
+    }
+  else
+    {
+      if (xneg)
+        exc.retval = -infinity_with_flags(AMD_F_OVERFLOW);
+      else
+        exc.retval = infinity_with_flags(AMD_F_OVERFLOW);
+    }
+  if (_LIB_VERSION == _POSIX_)
+    __set_errno(ERANGE);
+  else if (!matherr(&exc))
+    __set_errno(ERANGE);
+  return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(sinhf)
+#endif
+
+float FN_PROTOTYPE(sinhf)(float fx)
+{
+  /*
+    After dealing with special cases the computation is split into
+    regions as follows:
+
+    abs(x) >= max_sinh_arg:
+    sinh(x) = sign(x)*Inf
+
+    abs(x) >= small_threshold:
+    sinh(x) = sign(x)*exp(abs(x))/2 computed using the
+    splitexp and scaleDouble functions as for exp_amd().
+
+    abs(x) < small_threshold:
+    compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0)))
+    sinh(x) is then sign(x)*z.                             */
+
+  static const double
+    /* The max argument of sinhf, but stored as a double */
+    max_sinh_arg = 8.94159862922329438106e+01, /* 0x40565a9f84f82e63 */
+    thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */
+    log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */
+    log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */
+    small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889;
+  /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */
+
+  /* Tabulated values of sinh(i) and cosh(i) for i = 0,...,36. */
+
+  static const double sinh_lead[37] = {
+    0.00000000000000000000e+00,  /* 0x0000000000000000 */
+    1.17520119364380137839e+00,  /* 0x3ff2cd9fc44eb982 */
+    3.62686040784701857476e+00,  /* 0x400d03cf63b6e19f */
+    1.00178749274099008204e+01,  /* 0x40240926e70949ad */
+    2.72899171971277496596e+01,  /* 0x403b4a3803703630 */
+    7.42032105777887522891e+01,  /* 0x40528d0166f07374 */
+    2.01713157370279219549e+02,  /* 0x406936d22f67c805 */
+    5.48316123273246489589e+02,  /* 0x408122876ba380c9 */
+    1.49047882578955000099e+03,  /* 0x409749ea514eca65 */
+    4.05154190208278987484e+03,  /* 0x40afa7157430966f */
+    1.10132328747033916443e+04,  /* 0x40c5829dced69991 */
+    2.99370708492480553105e+04,  /* 0x40dd3c4488cb48d6 */
+    8.13773957064298447222e+04,  /* 0x40f3de1654d043f0 */
+    2.21206696003330085659e+05,  /* 0x410b00b5916a31a5 */
+    6.01302142081972560845e+05,  /* 0x412259ac48bef7e3 */
+    1.63450868623590236530e+06,  /* 0x4138f0ccafad27f6 */
+    4.44305526025387924165e+06,  /* 0x4150f2ebd0a7ffe3 */
+    1.20774763767876271158e+07,  /* 0x416709348c0ea4ed */
+    3.28299845686652474105e+07,  /* 0x417f4f22091940bb */
+    8.92411504815936237574e+07,  /* 0x419546d8f9ed26e1 */
+    2.42582597704895108938e+08,  /* 0x41aceb088b68e803 */
+    6.59407867241607308388e+08,  /* 0x41c3a6e1fd9eecfd */
+    1.79245642306579566002e+09,  /* 0x41dab5adb9c435ff */
+    4.87240172312445068359e+09,  /* 0x41f226af33b1fdc0 */
+    1.32445610649217357635e+10,  /* 0x4208ab7fb5475fb7 */
+    3.60024496686929321289e+10,  /* 0x4220c3d3920962c8 */
+    9.78648047144193725586e+10,  /* 0x4236c932696a6b5c */
+    2.66024120300899291992e+11,  /* 0x424ef822f7f6731c */
+    7.23128532145737548828e+11,  /* 0x42650bba3796379a */
+    1.96566714857202099609e+12,  /* 0x427c9aae4631c056 */
+    5.34323729076223046875e+12,  /* 0x429370470aec28ec */
+    1.45244248326237109375e+13,  /* 0x42aa6b765d8cdf6c */
+    3.94814800913403437500e+13,  /* 0x42c1f43fcc4b662c */
+    1.07321789892958031250e+14,  /* 0x42d866f34a725782 */
+    2.91730871263727437500e+14,  /* 0x42f0953e2f3a1ef7 */
+    7.93006726156715250000e+14,  /* 0x430689e221bc8d5a */
+    2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */
+
+  static const double cosh_lead[37] = {
+    1.00000000000000000000e+00,  /* 0x3ff0000000000000 */
+    1.54308063481524371241e+00,  /* 0x3ff8b07551d9f550 */
+    3.76219569108363138810e+00,  /* 0x400e18fa0df2d9bc */
+    1.00676619957777653269e+01,  /* 0x402422a497d6185e */
+    2.73082328360164865444e+01,  /* 0x403b4ee858de3e80 */
+    7.42099485247878334349e+01,  /* 0x40528d6fcbeff3a9 */
+    2.01715636122455890700e+02,  /* 0x406936e67db9b919 */
+    5.48317035155212010977e+02,  /* 0x4081228949ba3a8b */
+    1.49047916125217807348e+03,  /* 0x409749eaa93f4e76 */
+    4.05154202549259389343e+03,  /* 0x40afa715845d8894 */
+    1.10132329201033226127e+04,  /* 0x40c5829dd053712d */
+    2.99370708659497577173e+04,  /* 0x40dd3c4489115627 */
+    8.13773957125740562333e+04,  /* 0x40f3de1654d6b543 */
+    2.21206696005590405548e+05,  /* 0x410b00b5916b6105 */
+    6.01302142082804115489e+05,  /* 0x412259ac48bf13ca */
+    1.63450868623620807193e+06,  /* 0x4138f0ccafad2d17 */
+    4.44305526025399193168e+06,  /* 0x4150f2ebd0a8005c */
+    1.20774763767876680940e+07,  /* 0x416709348c0ea503 */
+    3.28299845686652623117e+07,  /* 0x417f4f22091940bf */
+    8.92411504815936237574e+07,  /* 0x419546d8f9ed26e1 */
+    2.42582597704895138741e+08,  /* 0x41aceb088b68e804 */
+    6.59407867241607308388e+08,  /* 0x41c3a6e1fd9eecfd */
+    1.79245642306579566002e+09,  /* 0x41dab5adb9c435ff */
+    4.87240172312445068359e+09,  /* 0x41f226af33b1fdc0 */
+    1.32445610649217357635e+10,  /* 0x4208ab7fb5475fb7 */
+    3.60024496686929321289e+10,  /* 0x4220c3d3920962c8 */
+    9.78648047144193725586e+10,  /* 0x4236c932696a6b5c */
+    2.66024120300899291992e+11,  /* 0x424ef822f7f6731c */
+    7.23128532145737548828e+11,  /* 0x42650bba3796379a */
+    1.96566714857202099609e+12,  /* 0x427c9aae4631c056 */
+    5.34323729076223046875e+12,  /* 0x429370470aec28ec */
+    1.45244248326237109375e+13,  /* 0x42aa6b765d8cdf6c */
+    3.94814800913403437500e+13,  /* 0x42c1f43fcc4b662c */
+    1.07321789892958031250e+14,  /* 0x42d866f34a725782 */
+    2.91730871263727437500e+14,  /* 0x42f0953e2f3a1ef7 */
+    7.93006726156715250000e+14,  /* 0x430689e221bc8d5a */
+    2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */
+
+  unsigned long long ux, aux, xneg;
+  double x = fx, y, z, z1, z2;
+  int m;
+
+  /* Special cases */
+
+  GET_BITS_DP64(x, ux);
+  aux = ux & ~SIGNBIT_DP64;
+  if (aux < 0x3f10000000000000) /* |x| small enough that sinh(x) = x */
+    {
+      if (aux == 0)
+        /* with no inexact */
+        return fx;
+      else
+        return valf_with_flags(fx, AMD_F_INEXACT);
+    }
+  else if (aux >= 0x7ff0000000000000) /* |x| is NaN or Inf */
+    {
+#ifdef WINDOWS
+      if (aux > 0x7ff0000000000000)
+        {
+          /* x is NaN */
+          unsigned int uhx;
+          GET_BITS_SP32(fx, uhx);
+          return handle_errorf("sinhf", uhx|0x00400000, _DOMAIN,
+                               AMD_F_INVALID, EDOM, fx, 0.0F);
+        }
+      else
+#endif
+      return fx + fx;
+    }
+
+  xneg = (aux != ux);
+
+  y = x;
+  if (xneg) y = -x;
+
+  if (y >= max_sinh_arg)
+    {
+      /* Return infinity with overflow flag. */
+#ifdef WINDOWS
+      if (xneg)
+         return handle_errorf("sinhf", NINFBITPATT_SP32, _OVERFLOW,
+                              AMD_F_OVERFLOW, ERANGE, fx, 0.0F);
+      else
+         return handle_errorf("sinhf", PINFBITPATT_SP32, _OVERFLOW,
+                              AMD_F_OVERFLOW, ERANGE, fx, 0.0F);
+#else
+      /* This handles POSIX behaviour */
+      __set_errno(ERANGE);
+        z = infinity_with_flags(AMD_F_OVERFLOW);
+#endif
+    }
+  else if (y >= small_threshold)
+    {
+      /* In this range y is large enough so that
+         the negative exponential is negligible,
+         so sinh(y) is approximated by sign(x)*exp(y)/2. The
+         code below is an inlined version of that from
+         exp() with two changes (it operates on
+         y instead of x, and the division by 2 is
+         done by reducing m by 1). */
+
+      splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead,
+               log2_by_32_tail, &m, &z1, &z2);
+      m -= 1;
+      /* scaleDouble_1 is always safe because the argument x was
+         float, rather than double */
+      z = scaleDouble_1((z1+z2),m);
+    }
+  else
+    {
+      /* In this range we find the integer part y0 of y
+         and the increment dy = y - y0. We then compute
+
+         z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy)
+
+         where sinh(y0) and cosh(y0) are tabulated above. */
+
+      int ind;
+      double dy, dy2, sdy, cdy;
+
+      ind = (int)y;
+      dy = y - ind;
+
+      dy2 = dy*dy;
+
+      sdy = dy + dy*dy2*(0.166666666666666667013899e0 +
+			 (0.833333333333329931873097e-2 +
+			  (0.198412698413242405162014e-3 +
+			   (0.275573191913636406057211e-5 +
+			    (0.250521176994133472333666e-7 +
+			     (0.160576793121939886190847e-9 +
+			      0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+      cdy = 1 + dy2*(0.500000000000000005911074e0 +
+		     (0.416666666666660876512776e-1 +
+		      (0.138888888889814854814536e-2 +
+		       (0.248015872460622433115785e-4 +
+			(0.275573350756016588011357e-6 +
+			 (0.208744349831471353536305e-8 +
+			  0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+      z = sinh_lead[ind]*cdy + cosh_lead[ind]*sdy;
+    }
+
+  if (xneg) z = - z;
+  return (float)z;
+}
+
+weak_alias (__sinhf, sinhf)

diff --git a/src/sqrt.c b/src/sqrt.c
new file mode 100644
index 0000000..14c5b1e
--- /dev/null
+++ b/src/sqrt.c

@@ -0,0 +1,65 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include <emmintrin.h>
+#include <math.h>
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+#include "../inc/libm_special.h"
+
+#ifdef WINDOWS
+#pragma function(sqrt)
+#endif
+/*SSE2 contains an instruction SQRTSD. This instruction Computes the square root 
+  of the low-order double-precision floating-point value in an XMM register
+  or in a 64-bit memory location and writes the result in the low-order quadword 
+  of another XMM register. The corresponding intrinsic is _mm_sqrt_sd()*/
+double FN_PROTOTYPE(sqrt)(double x)
+{
+    __m128d X128;
+    double result;
+    UT64 uresult;
+
+    if(x < 0.0)
+    {
+        uresult.u64 = 0xfff8000000000000;
+        __amd_handle_error(DOMAIN, EDOM, "sqrt", x, 0.0 , uresult.f64);
+        return uresult.f64;
+    }
+    /*Load x into an XMM register*/
+    X128 = _mm_load_sd(&x);
+    /*Calculate sqrt using SQRTSD instrunction*/
+    X128 = _mm_sqrt_sd(X128, X128);
+    /*Store back the result into a double precision floating point number*/
+    _mm_store_sd(&result, X128);
+    return result;
+}
+
+

diff --git a/src/sqrtf.c b/src/sqrtf.c
new file mode 100644
index 0000000..48e53cd
--- /dev/null
+++ b/src/sqrtf.c

@@ -0,0 +1,73 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include <emmintrin.h>
+#include <math.h>
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+#include "../inc/libm_special.h"
+
+#ifdef WINDOWS
+#pragma function(sqrtf)
+#endif
+/*SSE2 contains an instruction SQRTSS. This instruction Computes the square root 
+  of the low-order single-precision floating-point value in an XMM register
+  or in a 32-bit memory location and writes the result in the low-order doubleword 
+  of another XMM register. The corresponding intrinsic is _mm_sqrt_ss()*/
+float FN_PROTOTYPE(sqrtf)(float x)
+{
+    __m128 X128;
+    float result;
+    UT32 uresult;
+
+    if(x < 0.0)
+    {
+        uresult.u32 = 0xffc00000;
+
+        {
+            unsigned int is_x_snan;
+            UT32 xm; xm.f32 = x;
+            is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+            __amd_handle_errorf(DOMAIN, EDOM, "sqrt", x, is_x_snan, 0.0f, 0, uresult.f32, 0);
+        }
+
+        return uresult.f32;
+    }
+
+    /*Load x into an XMM register*/
+    X128 = _mm_load_ss(&x);
+    /*Calculate sqrt using SQRTSS instrunction*/
+    X128 = _mm_sqrt_ss(X128);
+    /*Store back the result into a single precision floating point number*/
+    _mm_store_ss(&result, X128);
+    return result;
+}
+
+

diff --git a/src/tan.c b/src/tan.c
new file mode 100644
index 0000000..a7fe651
--- /dev/null
+++ b/src/tan.c

@@ -0,0 +1,260 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_NAN_WITH_FLAGS
+#define USE_VAL_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#endif
+
+extern void __amd_remainder_piby2(double x, double *r, double *rr, int *region);
+
+/* tan(x + xx) approximation valid on the interval [-pi/4,pi/4].
+   If recip is true return -1/tan(x + xx) instead. */
+static inline double tan_piby4(double x, double xx, int recip)
+{
+  double r, t1, t2, xl;
+  int transform = 0;
+  static const double
+     piby4_lead = 7.85398163397448278999e-01, /* 0x3fe921fb54442d18 */
+     piby4_tail = 3.06161699786838240164e-17; /* 0x3c81a62633145c06 */
+
+  /* In order to maintain relative precision transform using the identity:
+     tan(pi/4-x) = (1-tan(x))/(1+tan(x)) for arguments close to pi/4.
+     Similarly use tan(x-pi/4) = (tan(x)-1)/(tan(x)+1) close to -pi/4. */
+
+  if (x > 0.68)
+    {
+      transform = 1;
+      x = piby4_lead - x;
+      xl = piby4_tail - xx;
+      x += xl;
+      xx = 0.0;
+    }
+  else if (x < -0.68)
+    {
+      transform = -1;
+      x = piby4_lead + x;
+      xl = piby4_tail + xx;
+      x += xl;
+      xx = 0.0;
+    }
+
+  /* Core Remez [2,3] approximation to tan(x+xx) on the
+     interval [0,0.68]. */
+
+  r = x*x + 2.0 * x * xx;
+  t1 = x;
+  t2 = xx + x*r*
+    (0.372379159759792203640806338901e0 +
+     (-0.229345080057565662883358588111e-1 +
+      0.224044448537022097264602535574e-3*r)*r)/
+    (0.111713747927937668539901657944e1 +
+     (-0.515658515729031149329237816945e0 +
+      (0.260656620398645407524064091208e-1 -
+       0.232371494088563558304549252913e-3*r)*r)*r);
+
+  /* Reconstruct tan(x) in the transformed case. */
+
+  if (transform)
+    {
+      double t;
+      t = t1 + t2;
+      if (recip)
+         return transform*(2*t/(t-1) - 1.0);
+      else
+         return transform*(1.0 - 2*t/(1+t));
+    }
+
+  if (recip)
+    {
+      /* Compute -1.0/(t1 + t2) accurately */
+      double trec, trec_top, z1, z2, t;
+      unsigned long long u;
+      t = t1 + t2;
+      GET_BITS_DP64(t, u);
+      u &= 0xffffffff00000000;
+      PUT_BITS_DP64(u, z1);
+      z2 = t2 - (z1 - t1);
+      trec = -1.0 / t;
+      GET_BITS_DP64(trec, u);
+      u &= 0xffffffff00000000;
+      PUT_BITS_DP64(u, trec_top);
+      return trec_top + trec * ((1.0 + trec_top * z1) + trec_top * z2);
+
+    }
+  else
+    return t1 + t2;
+}
+
+#ifdef WINDOWS
+#pragma function(tan)
+#endif
+
+double FN_PROTOTYPE(tan)(double x)
+{
+  double r, rr;
+  int region, xneg;
+
+  unsigned long long ux, ax;
+  GET_BITS_DP64(x, ux);
+  ax = (ux & ~SIGNBIT_DP64);
+  if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+    {
+      if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+        {
+          if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */
+	    {
+	      if (ax == 0x0000000000000000) return x;
+              else return val_with_flags(x, AMD_F_INEXACT);
+	    }
+          else
+            {
+#ifdef WINDOWS
+              /* Using a temporary variable prevents 64-bit VC++ from
+                 rearranging
+                    x + x*x*x*0.333333333333333333;
+                 into
+                    x * (1 + x*x*0.333333333333333333);
+                 The latter results in an incorrectly rounded answer. */
+              double tmp;
+              tmp = x*x*x*0.333333333333333333;
+              return x + tmp;
+#else
+              return x + x*x*x*0.333333333333333333;
+#endif
+            }
+        }
+      else
+        return tan_piby4(x, 0.0, 0);
+    }
+  else if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_DP64)
+        /* x is NaN */
+#ifdef WINDOWS
+        return handle_error("tan", ux|0x0008000000000000, _DOMAIN, 0,
+                            EDOM, x, 0.0);
+#else
+        return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+      else
+        /* x is infinity. Return a NaN */
+#ifdef WINDOWS
+        return handle_error("tan", INDEFBITPATT_DP64, _DOMAIN, 0,
+                            EDOM, x, 0.0);
+#else
+        return nan_with_flags(AMD_F_INVALID);
+#endif
+    }
+  xneg = (ax != ux);
+
+
+  if (xneg)
+    x = -x;
+
+  if (x < 5.0e5)
+    {
+      /* For these size arguments we can just carefully subtract the
+         appropriate multiple of pi/2, using extra precision where
+         x is close to an exact multiple of pi/2 */
+      static const double
+        twobypi =  6.36619772367581382433e-01, /* 0x3fe45f306dc9c883 */
+        piby2_1  =  1.57079632673412561417e+00, /* 0x3ff921fb54400000 */
+        piby2_1tail =  6.07710050650619224932e-11, /* 0x3dd0b4611a626331 */
+        piby2_2  =  6.07710050630396597660e-11, /* 0x3dd0b4611a600000 */
+        piby2_2tail =  2.02226624879595063154e-21, /* 0x3ba3198a2e037073 */
+        piby2_3  =  2.02226624871116645580e-21, /* 0x3ba3198a2e000000 */
+        piby2_3tail =  8.47842766036889956997e-32; /* 0x397b839a252049c1 */
+      double t, rhead, rtail;
+      int npi2;
+      unsigned long long uy, xexp, expdiff;
+      xexp  = ax >> EXPSHIFTBITS_DP64;
+      /* How many pi/2 is x a multiple of? */
+      if (ax <= 0x400f6a7a2955385e) /* 5pi/4 */
+        {
+          if (ax <= 0x4002d97c7f3321d2) /* 3pi/4 */
+            npi2 = 1;
+          else
+            npi2 = 2;
+        }
+      else if (ax <= 0x401c463abeccb2bb) /* 9pi/4 */
+        {
+          if (ax <= 0x4015fdbbe9bba775) /* 7pi/4 */
+            npi2 = 3;
+          else
+            npi2 = 4;
+        }
+      else
+        npi2  = (int)(x * twobypi + 0.5);
+      /* Subtract the multiple from x to get an extra-precision remainder */
+      rhead  = x - npi2 * piby2_1;
+      rtail  = npi2 * piby2_1tail;
+      GET_BITS_DP64(rhead, uy);
+      expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+      if (expdiff > 15)
+        {
+          /* The remainder is pretty small compared with x, which
+             implies that x is a near multiple of pi/2
+             (x matches the multiple to at least 15 bits) */
+          t  = rhead;
+          rtail  = npi2 * piby2_2;
+          rhead  = t - rtail;
+          rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+          if (expdiff > 48)
+            {
+              /* x matches a pi/2 multiple to at least 48 bits */
+              t  = rhead;
+              rtail  = npi2 * piby2_3;
+              rhead  = t - rtail;
+              rtail  = npi2 * piby2_3tail - ((t - rhead) - rtail);
+            }
+        }
+      r = rhead - rtail;
+      rr = (rhead - r) - rtail;
+      region = npi2 & 3;
+    }
+  else
+    {
+      /* Reduce x into range [-pi/4,pi/4] */
+       __amd_remainder_piby2(x, &r, &rr, &region);
+     /* __remainder_piby2(x, &r, &rr, &region);*/
+    }
+
+  if (xneg)
+    return -tan_piby4(r, rr, region & 1);
+  else
+    return tan_piby4(r, rr, region & 1);
+}
+
+weak_alias (__tan, tan)

diff --git a/src/tanf.c b/src/tanf.c
new file mode 100644
index 0000000..856cdcf
--- /dev/null
+++ b/src/tanf.c

@@ -0,0 +1,203 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+/*#define USE_REMAINDER_PIBY2F_INLINE*/
+#define USE_VALF_WITH_FLAGS
+#define USE_NANF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_VALF_WITH_FLAGS
+#undef USE_NANF_WITH_FLAGS
+/*#undef USE_REMAINDER_PIBY2F_INLINE*/
+#undef USE_HANDLE_ERRORF
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#endif
+
+extern void __amd_remainder_piby2d2f(unsigned long long ux, double *r, int *region);
+
+/* tan(x) approximation valid on the interval [-pi/4,pi/4].
+   If recip is true return -1/tan(x) instead. */
+static inline double tanf_piby4(double x, int recip)
+{
+  double r, t;
+
+  /* Core Remez [1,2] approximation to tan(x) on the
+     interval [0,pi/4]. */
+  r = x*x;
+  t = x + x*r*
+    (0.385296071263995406715129e0 -
+     0.172032480471481694693109e-1 * r) /
+    (0.115588821434688393452299e+1 +
+     (-0.51396505478854532132342e0 +
+      0.1844239256901656082986661e-1 * r) * r);
+
+  if (recip)
+    return -1.0 / t;
+  else
+    return t;
+}
+
+#ifdef WINDOWS
+#pragma function(tanf)
+#endif
+
+float FN_PROTOTYPE(tanf)(float x)
+{
+  double r, dx;
+  int region, xneg;
+
+  unsigned long long ux, ax;
+
+  dx = x;
+
+  GET_BITS_DP64(dx, ux);
+  ax = (ux & ~SIGNBIT_DP64);
+
+  if (ax <= 0x3fe921fb54442d18LL) /* abs(x) <= pi/4 */
+    {
+      if (ax < 0x3f80000000000000LL) /* abs(x) < 2.0^(-7) */
+        {
+          if (ax < 0x3f20000000000000LL) /* abs(x) < 2.0^(-13) */
+            {
+              if (ax == 0x0000000000000000LL)
+                return x;
+              else
+                return valf_with_flags(x, AMD_F_INEXACT);
+            }
+          else
+            return (float)(dx + dx*dx*dx*0.333333333333333333);
+        }
+      else
+        return (float)tanf_piby4(x, 0);
+    }
+  else if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+    {
+      /* x is either NaN or infinity */
+      if (ux & MANTBITS_DP64)
+        {
+          /* x is NaN */
+#ifdef WINDOWS
+          unsigned int ufx;
+          GET_BITS_SP32(x, ufx);
+          return handle_errorf("tanf", ufx|0x00400000, _DOMAIN, 0,
+                               EDOM, x, 0.0F);
+#else
+          return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+        }
+      else
+        {
+          /* x is infinity. Return a NaN */
+#ifdef WINDOWS
+          return handle_errorf("tanf", INDEFBITPATT_SP32, _DOMAIN, 0,
+                               EDOM, x, 0.0F);
+#else
+          return nanf_with_flags(AMD_F_INVALID);
+#endif
+        }
+    }
+
+  xneg = (int)(ux >> 63);
+
+  if (xneg)
+    dx = -dx;
+
+  if (dx < 5.0e5)
+    {
+      /* For these size arguments we can just carefully subtract the
+         appropriate multiple of pi/2, using extra precision where
+         dx is close to an exact multiple of pi/2 */
+      static const double
+        twobypi =  6.36619772367581382433e-01, /* 0x3fe45f306dc9c883 */
+        piby2_1  =  1.57079632673412561417e+00, /* 0x3ff921fb54400000 */
+        piby2_1tail =  6.07710050650619224932e-11, /* 0x3dd0b4611a626331 */
+        piby2_2  =  6.07710050630396597660e-11, /* 0x3dd0b4611a600000 */
+        piby2_2tail =  2.02226624879595063154e-21, /* 0x3ba3198a2e037073 */
+        piby2_3  =  2.02226624871116645580e-21, /* 0x3ba3198a2e000000 */
+        piby2_3tail =  8.47842766036889956997e-32; /* 0x397b839a252049c1 */
+      double t, rhead, rtail;
+      int npi2;
+      unsigned long long uy, xexp, expdiff;
+      xexp  = ax >> EXPSHIFTBITS_DP64;
+      /* How many pi/2 is dx a multiple of? */
+      if (ax <= 0x400f6a7a2955385eLL) /* 5pi/4 */
+        {
+          if (ax <= 0x4002d97c7f3321d2LL) /* 3pi/4 */
+            npi2 = 1;
+          else
+            npi2 = 2;
+        }
+      else if (ax <= 0x401c463abeccb2bbLL) /* 9pi/4 */
+        {
+          if (ax <= 0x4015fdbbe9bba775LL) /* 7pi/4 */
+            npi2 = 3;
+          else
+            npi2 = 4;
+        }
+      else
+        npi2  = (int)(dx * twobypi + 0.5);
+      /* Subtract the multiple from dx to get an extra-precision remainder */
+      rhead  = dx - npi2 * piby2_1;
+      rtail  = npi2 * piby2_1tail;
+      GET_BITS_DP64(rhead, uy);
+      expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+      if (expdiff > 15)
+        {
+          /* The remainder is pretty small compared with dx, which
+             implies that dx is a near multiple of pi/2
+             (dx matches the multiple to at least 15 bits) */
+          t  = rhead;
+          rtail  = npi2 * piby2_2;
+          rhead  = t - rtail;
+          rtail  = npi2 * piby2_2tail - ((t - rhead) - rtail);
+          if (expdiff > 48)
+            {
+              /* dx matches a pi/2 multiple to at least 48 bits */
+              t  = rhead;
+              rtail  = npi2 * piby2_3;
+              rhead  = t - rtail;
+              rtail  = npi2 * piby2_3tail - ((t - rhead) - rtail);
+            }
+        }
+      r = rhead - rtail;
+      region = npi2 & 3;
+    }
+  else
+    { 
+		/* Reduce x into range [-pi/4,pi/4] */
+		__amd_remainder_piby2d2f(ax, &r, &region);
+    }
+
+  if (xneg)
+    return (float)-tanf_piby4(r, region & 1);
+  else
+    return (float)tanf_piby4(r, region & 1);
+}
+
+weak_alias (__tanf, tanf)

diff --git a/src/tanh.c b/src/tanh.c
new file mode 100644
index 0000000..ead758b
--- /dev/null
+++ b/src/tanh.c

@@ -0,0 +1,129 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+#define USE_SPLITEXP
+#define USE_SCALEDOUBLE_2
+#define USE_VAL_WITH_FLAGS
+#include "../inc/libm_inlines_amd.h"
+#undef USE_SPLITEXP
+#undef USE_SCALEDOUBLE_2
+#undef USE_VAL_WITH_FLAGS
+
+double FN_PROTOTYPE(tanh)(double x)
+{
+  /*
+    The definition of tanh(x) is sinh(x)/cosh(x), which is also equivalent
+    to the following three formulae:
+    1.  (exp(x) - exp(-x))/(exp(x) + exp(-x))
+    2.  (1 - (2/(exp(2*x) + 1 )))
+    3.  (exp(2*x) - 1)/(exp(2*x) + 1)
+    but computationally, some formulae are better on some ranges.
+  */
+  static const double
+    thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */
+    log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */
+    log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */
+    large_threshold = 20.0; /* 0x4034000000000000 */
+
+  unsigned long long ux, aux, xneg;
+  double y, z, p, z1, z2;
+  int m;
+
+  /* Special cases */
+
+  GET_BITS_DP64(x, ux);
+  aux = ux & ~SIGNBIT_DP64;
+  if (aux < 0x3e30000000000000) /* |x| small enough that tanh(x) = x */
+    {
+      if (aux == 0)
+        return x; /* with no inexact */
+      else
+        return val_with_flags(x, AMD_F_INEXACT);
+    }
+  else if (aux > 0x7ff0000000000000) /* |x| is NaN */
+    return x + x;
+
+  xneg = (aux != ux);
+
+  y = x;
+  if (xneg) y = -x;
+
+  if (y > large_threshold)
+    {
+      /* If x is large then exp(-x) is negligible and
+         formula 1 reduces to plus or minus 1.0 */
+      z = 1.0;
+    }
+  else if (y <= 1.0)
+    {
+      double y2;
+      y2 = y*y;
+      if (y < 0.9)
+        {
+          /* Use a [3,3] Remez approximation on [0,0.9]. */
+          z = y + y*y2*
+            (-0.274030424656179760118928e0 +
+             (-0.176016349003044679402273e-1 +
+              (-0.200047621071909498730453e-3 -
+               0.142077926378834722618091e-7*y2)*y2)*y2)/
+            (0.822091273968539282568011e0 +
+             (0.381641414288328849317962e0 +
+              (0.201562166026937652780575e-1 +
+               0.2091140262529164482568557e-3*y2)*y2)*y2);
+        }
+      else
+        {
+          /* Use a [3,3] Remez approximation on [0.9,1]. */
+          z = y + y*y2*
+            (-0.227793870659088295252442e0 +
+             (-0.146173047288731678404066e-1 +
+              (-0.165597043903549960486816e-3 -
+               0.115475878996143396378318e-7*y2)*y2)*y2)/
+            (0.683381611977295894959554e0 +
+             (0.317204558977294374244770e0 +
+              (0.167358775461896562588695e-1 +
+               0.173076050126225961768710e-3*y2)*y2)*y2);
+        }
+    }
+  else
+    {
+      /* Compute p = exp(2*y) + 1. The code is basically inlined
+         from exp_amd. */
+
+      splitexp(2*y, 1.0, thirtytwo_by_log2, log2_by_32_lead,
+	       log2_by_32_tail, &m, &z1, &z2);
+      p = scaleDouble_2(z1 + z2, m) + 1.0;
+
+      /* Now reconstruct tanh from p. */
+      z = (1.0 - 2.0/p);
+    }
+
+  if (xneg) z = - z;
+  return z;
+}
+
+weak_alias (__tanh, tanh)

diff --git a/src/tanhf.c b/src/tanhf.c
new file mode 100644
index 0000000..1cb14c4
--- /dev/null
+++ b/src/tanhf.c

@@ -0,0 +1,126 @@
+
+/*
+*  Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+*  This file is part of libacml_mv.
+*
+*  libacml_mv is free software; you can redistribute it and/or
+*  modify it under the terms of the GNU Lesser General Public
+*  License as published by the Free Software Foundation; either
+*  version 2.1 of the License, or (at your option) any later version.
+*
+*  libacml_mv is distributed in the hope that it will be useful,
+*  but WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  Lesser General Public License for more details.
+*
+*  You should have received a copy of the GNU Lesser General Public
+*  License along with libacml_mv.  If not, see
+*  <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+
+#define USE_SPLITEXPF
+#define USE_SCALEFLOAT_2
+#define USE_VALF_WITH_FLAGS
+#include "../inc/libm_inlines_amd.h"
+#undef USE_SPLITEXPF
+#undef USE_SCALEFLOAT_2
+#undef USE_VALF_WITH_FLAGS
+
+#include "../inc/libm_errno_amd.h"
+
+float FN_PROTOTYPE(tanhf)(float x)
+{
+  /*
+    The definition of tanh(x) is sinh(x)/cosh(x), which is also equivalent
+    to the following three formulae:
+    1.  (exp(x) - exp(-x))/(exp(x) + exp(-x))
+    2.  (1 - (2/(exp(2*x) + 1 )))
+    3.  (exp(2*x) - 1)/(exp(2*x) + 1)
+    but computationally, some formulae are better on some ranges.
+  */
+  static const float
+    thirtytwo_by_log2 =  4.6166240692e+01F, /* 0x4238aa3b */
+    log2_by_32_lead =  2.1659851074e-02F, /* 0x3cb17000 */
+    log2_by_32_tail =  9.9831822808e-07F, /* 0x3585fdf4 */
+    large_threshold = 10.0F; /* 0x41200000 */
+
+  unsigned int ux, aux;
+  float y, z, p, z1, z2, xneg;
+  int m;
+
+  /* Special cases */
+
+  GET_BITS_SP32(x, ux);
+  aux = ux & ~SIGNBIT_SP32;
+  if (aux < 0x39000000) /* |x| small enough that tanh(x) = x */
+    {
+      if (aux == 0)
+        return x; /* with no inexact */
+      else
+        return valf_with_flags(x, AMD_F_INEXACT);
+    }
+  else if (aux > 0x7f800000) /* |x| is NaN */
+    return x + x;
+
+  xneg = 1.0F - 2.0F * (aux != ux);
+
+  y = xneg * x;
+
+  if (y > large_threshold)
+    {
+      /* If x is large then exp(-x) is negligible and
+         formula 1 reduces to plus or minus 1.0 */
+      z = 1.0F;
+    }
+  else if (y <= 1.0F)
+    {
+      float y2;
+      y2 = y*y;
+
+      if (y < 0.9F)
+        {
+          /* Use a [2,1] Remez approximation on [0,0.9]. */
+          z = y + y*y2*
+            (-0.28192806108402678e0F +
+             (-0.14628356048797849e-2F +
+              0.4891631088530669873e-4F*y2)*y2)/
+            (0.845784192581041099e0F +
+             0.3427017942262751343e0F*y2);
+        }
+      else
+        {
+          /* Use a [2,1] Remez approximation on [0.9,1]. */
+          z = y + y*y2*
+            (-0.24069858695196524e0F +
+             (-0.12325644183611929e-2F +
+              0.3827534993599483396e-4F*y2)*y2)/
+            (0.72209738473684982e0F +
+             0.292529068698052819e0F*y2);
+        }
+    }
+  else
+    {
+      /* Compute p = exp(2*y) + 1. The code is basically inlined
+         from exp_amd. */
+
+      splitexpf(2*y, 1.0F, thirtytwo_by_log2, log2_by_32_lead,
+	       log2_by_32_tail, &m, &z1, &z2);
+      p = scaleFloat_2(z1 + z2, m) + 1.0F;
+      /* Now reconstruct tanh from p. */
+      z = (1.0F - 2.0F/p);
+    }
+
+  return xneg * z;
+}
+
+
+weak_alias (__tanhf, tanhf)

diff --git a/testdata/exp.rephil_docs.builtin.baseline.trace b/testdata/exp.rephil_docs.builtin.baseline.trace
new file mode 100644
index 0000000..8344f12
--- /dev/null
+++ b/testdata/exp.rephil_docs.builtin.baseline.trace
Binary files differ

diff --git a/testdata/expf.fastmath_unittest.trace b/testdata/expf.fastmath_unittest.trace
new file mode 100644
index 0000000..c867b36
--- /dev/null
+++ b/testdata/expf.fastmath_unittest.trace
Binary files differ

diff --git a/testdata/log.rephil_docs.builtin.baseline.trace b/testdata/log.rephil_docs.builtin.baseline.trace
new file mode 100644
index 0000000..e87d631
--- /dev/null
+++ b/testdata/log.rephil_docs.builtin.baseline.trace
Binary files differ

diff --git a/testdata/notes.txt b/testdata/notes.txt
new file mode 100644
index 0000000..8b5884f
--- /dev/null
+++ b/testdata/notes.txt

@@ -0,0 +1,23 @@
+The traces in this directory are used for validating and testing
+performance of the math library. Each file contains the input
+arguments to the specific math functions, written in raw binary
+format.
+
+exp,log,pow are collected from the Perflab benchmark
+compiler/rephil/docs/v7 and expf is collected from
+util/math:fastmath_unittest.
+
+The traces were collected by linking in a small library that wrote
+that first 4M arguments to file before returning the actual value.
+ - Library was added as a dep to "base:base".
+ - To avoid write samples for genrules, the profiling was guarded by a
+   macro that was defined using --copt.
+ - Tcmalloc holds a lock while it calls log(), so care had to be taken
+   not to cause a deadlock in the profiling of log().
+   For the other functions, the actual value could be calculated
+   using something like this:
+     _exp = (double (*)(double)) dlsym(RTLD_NEXT, "exp");
+     return _exp(x);
+   for log(), we made the following call:
+     return log10(x)/log10(2.71828182846);
+

diff --git a/testdata/pow.rephil_docs.builtin.baseline.trace b/testdata/pow.rephil_docs.builtin.baseline.trace
new file mode 100644
index 0000000..b7a9722
--- /dev/null
+++ b/testdata/pow.rephil_docs.builtin.baseline.trace
Binary files differ
commit	fb23db2afae06caf1e0b13cd08d27e2d4ff17530	[log] [tgz]
author	Googler <noreply@google.com>	Wed Sep 25 22:04:22 2019 -0700
committer	James Lemieux <jplemieux@google.com>	Wed Sep 25 23:55:52 2019 -0700
tree	755c7287290253a4b46d1496d30cebed0cbd4de2