[Dev] NeoScrypt GPU Miner - Public Beta Test

SS2006

i assume this is the newest version: http://cryptomining-blog.com/3715-new-cgminer-3-7-8-with-improved-neoscrypt-performance/

unfortunately it took a step backwards for nvidia cards (slower), but its meant for AMD so if that is better there, then good!

i assume this is the newest version: http://cryptomining-blog.com/3715-new-cgminer-3-7-8-with-improved-neoscrypt-performance/

unfortunately it took a step backwards for nvidia cards (slower), but its meant for AMD so if that is better there, then good!

The download link that this blog provides is hosted by itself… No sigs etc…

Not 100% sure if I would trust that download link.

Wolf0 has a nice windows compile right here signed and all.

Okay, done. I’m pretty sure it works, but haven’t tested on a Windows installation. This is a zip of my kernel, slighly modified for SGMiner, as well as all the other kernels included on the github’s develop branch, and a Win64 binary. Static compile, no DLLs, just like my standard SGMiner builds on Litecointalk. Also GPG signed, like my standard builds. Someone please test for me and ensure it works.

https://ottrbutt.com/sgminer/neoscrypt/sgminer5-neoscrypt-11-02-2014.zip

And of course, GPG sigs for those that check them (you should be): https://ottrbutt.com/sgminer/neoscrypt/sgminer5-neoscrypt-11-02-2014.zip.sig

ghostlander

Congrats - not a bad chacha for 6xxx. Your salsa though… needs work.

It’s scalar now. I’m not impressed by one found in old Scrypt kernels. Can write a better one probably. It isn’t a bottleneck anyway for these cards. When I replaced ChaCha with this one, it went from something like 12KH/s to 14KH/s. Loop unrolling delivered more alone.

BTW, it runs reasonably good with old AMD drivers and OpenCL compilers. HD5870 on Windows XP with 12.4 drivers went from 2.5KH/s to 10KH/s. Hell yeah, a 4x increase. Don’t try to use this kernel for NVIDIA. It fails to compile the vectorised ChaCha code.

How it use? Which miner?

Rename and put into any miner with the NeoScrypt support. Tested on cgminer v3.7.7, works fine.

Alpha Wolf

For those with old Radeon cards. This is my current OpenCL kernel: neoscrypt_vliw.cl

It is optimised to some extent for VLIW4/VLIW5. I get 17.5KH/s with it on a HD6970. That’s not much, but still better than 6KH/s with the default kernel.

Working great here for me with a couple 6950 unlocked to 6970. Went from 5kh/s to 16.5kh/s with

use of only -12 -w 64 -g 2 also works with 3.7.7c and 3.7.8 all I did was backup neoscrypt140909.cl then

delete it and remane your file to neoscrypt140909.cl to replace it, then backed up then deleted all .bin files

letting it make new .bins and bam I was off to the races. :)

Thank you!

Now I don’t mind running them, before I wouldn’t even use them to mine with.

Catalyst Version 13.12

{
"pools" : [
	{
		"url" : "http://us.mine-ftc.co.uk:19327",
		"user" : "xxxxxxxxxxxxy",
		"pass" : "x"
	}
]
,
"intensity" : "12,12",
"vectors" : "1,1",
"worksize" : "64,64",
"gpu-engine" : "825-825,825-825",
"gpu-fan" : "0-95,0-80",
"gpu-memclock" : "1300,1300",
"gpu-memdiff" : "0,0",
"gpu-powertune" : "0,0",
"gpu-vddc" : "0.000,0.000",
"temp-cutoff" : "90,90",
"temp-overheat" : "85,85",
"temp-target" : "70,70",
"api-mcast-port" : "4028",
"api-port" : "4028",
"expiry" : "1",
"failover-only" : true,
"gpu-dyninterval" : "7",
"gpu-platform" : "0",
"gpu-threads" : "2",
"log" : "5",
"neoscrypt" : true,
"no-pool-disable" : true,
"no-submit-stale" : true,
"queue" : "0",
"scan-time" : "1",
"temp-hysteresis" : "3",
"shares" : "0",
"kernel-path" : "/usr/local/bin",
"device" : "0-1"
}

MrBeen

I am right that I can not do anything with my GeForce 7600GS?

cisahasa

For those with old Radeon cards. This is my current OpenCL kernel: neoscrypt_vliw.cl

It is optimised to some extent for VLIW4/VLIW5. I get 17.5KH/s with it on a HD6970. That’s not much, but still better than 6KH/s with the default kernel.

i think there is some error…

/* NeoScrypt core engine:
* N = 128, r = 2, p = 1, salt = password */
__attribute__((reqd_work_group_size(WORKGROUPSIZE, 1, 1)))

it should be? ???

/* NeoScrypt core engine:
* N = 128, r = 2, p = 1, salt = password */
__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))

ghostlander

i think there is some error…

/* NeoScrypt core engine:
* N = 128, r = 2, p = 1, salt = password */
__attribute__((reqd_work_group_size(WORKGROUPSIZE, 1, 1)))

it should be? ???

/* NeoScrypt core engine:
* N = 128, r = 2, p = 1, salt = password */
__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))

There is no error, but it doesn’t really matter.

[2014-11-07 18:47:02] Started cgminer 3.7.8
[2014-11-07 18:47:07] Probing for an alive pool
[2014-11-07 18:47:08] Error -11: Building Program (clBuildProgram)
[2014-11-07 18:47:08] “/tmp/OCLxlhCZF.cl”, line 665: error: identifier “WORKSIZE” is undefined
__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
^

1 error detected in the compilation of “/tmp/OCLxlhCZF.cl”.

Internal error: clc comp

Wolf0

i think there is some error…

/* NeoScrypt core engine:
* N = 128, r = 2, p = 1, salt = password */
__attribute__((reqd_work_group_size(WORKGROUPSIZE, 1, 1)))

it should be? ???

/* NeoScrypt core engine:
* N = 128, r = 2, p = 1, salt = password */
__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))

WORKSIZE is only for the newer SGMiner.

cisahasa

thats what i used…

Wolf0

It’s scalar now. I’m not impressed by one found in old Scrypt kernels. Can write a better one probably. It isn’t a bottleneck anyway for these cards. When I replaced ChaCha with this one, it went from something like 12KH/s to 14KH/s. Loop unrolling delivered more alone.

BTW, it runs reasonably good with old AMD drivers and OpenCL compilers. HD5870 on Windows XP with 12.4 drivers went from 2.5KH/s to 10KH/s. Hell yeah, a 4x increase. Don’t try to use this kernel for NVIDIA. It fails to compile the vectorised ChaCha code.

Rename and put into any miner with the NeoScrypt support. Tested on cgminer v3.7.7, works fine.

I’ve never worked on 6xxx, but isn’t shuffle cheap? One shuffle for the permutation, keep it through all the salsa rounds + XOR ops, one shuffle to fix it. Seems like it’d be worth it - of course, unrolling will deliver a lot, probably.

Wolf0

It seems my 280X and 290X do like parallel chacha - I just needed to tweak it a bit more. Code size seems about the same, though, small speedup on execution time, I think.

Wolf0

OH MY GOD. I’ve been staring at this code for ages, and it only JUST NOW occurred to me that SMix() is parallelizable. Not the internals of SMix, of course, but the two calls to it…

cisahasa

i really hope wolf you are not doing this just for your self…

you would gain more if you release your work, im sure people here would like to collect some bounty for your work to you release latest kernels.

these people here are fair people.

Wolf0

i really hope wolf you are not doing this just for your self…

you would gain more if you release your work, im sure people here would like to collect some bounty for your work to you release latest kernels.

these people here are fair people.

I’m doing this because it’s interesting. Also, SMix being parallelizable hardly matters unless you split it into 3 kernels, which is doable, but idk what the overhead on the kernel launches would be…

T4rQu1N

Quick question, in my bat file, how do I specify different values for say -i or -w so that my two cards (which are different) have different settings?

-i 14,15?

-w 48, 72?

Kind regards,

T4

einkerl

Quick question, in my bat file, how do I specify different values for say -i or -w so that my two cards (which are different) have different settings?

-i 14,15?

-w 48, 72?

Kind regards,

T4

“intensity” : “18,18,18”,
“worksize” : “256,128,256”,

specify in your .conf

ghostlander

neoscrypt_vliw.cl v2

It’s 19.5KH/s now on a HD6970. FastKDF and BLAKE2s have been cleaned up and optimised, memory requirements reduced.

OH MY GOD. I’ve been staring at this code for ages, and it only JUST NOW occurred to me that SMix() is parallelizable. Not the internals of SMix, of course, but the two calls to it…

Yeah, I’ve mentioned this in my white paper. Not sure if it’s of any use for mining.

I’ve never worked on 6xxx, but isn’t shuffle cheap? One shuffle for the permutation, keep it through all the salsa rounds + XOR ops, one shuffle to fix it. Seems like it’d be worth it - of course, unrolling will deliver a lot, probably.

It is, but that’s not what concerns me now. With FastKDF removed, the kernel gets reduced in size by ~60% and outputs 30KH/s.That’s a big overhead, but not critical and I’ve expected more out of ChaCha + Salsa. With ChaCha only enabled, it’s 58KH/s and with Salsa only = 56KH/s. Scalar Salsa isn’t supposed to be about as fast as vectorised ChaCha. It’s clearly scalar because the AMD compiler isn’t really smart and the kernel size is about double of ChaCha only size. Anyway, there is a huge bottleneck somewhere and it needs to be identified.

Wolf0

neoscrypt_vliw.cl v2

It’s 19.5KH/s now on a HD6970. FastKDF and BLAKE2s have been cleaned up and optimised, memory requirements reduced.

Yeah, I’ve mentioned this in my white paper. Not sure if it’s of any use for mining.

It is, but that’s not what concerns me now. With FastKDF removed, the kernel gets reduced in size by ~60% and outputs 30KH/s.That’s a big overhead, but not critical and I’ve expected more out of ChaCha + Salsa. With ChaCha only enabled, it’s 58KH/s and with Salsa only = 56KH/s. Scalar Salsa isn’t supposed to be about as fast as vectorised ChaCha. It’s clearly scalar because the AMD compiler isn’t really smart and the kernel size is about double of ChaCha only size. Anyway, there is a huge bottleneck somewhere and it needs to be identified.

I have a really hard time reading your style, but the code is pretty good! Don’t you think that bottleneck is waiting for global memory, though?

ghostlander

My 1st guess it runs out of private memory. It takes 512 bytes for block mixing + 800 bytes for FastKDF and BLAKE2s per kernel instance. That’s not including local variables, counters, etc. Scrypt consumes 3 times less private memory. It’s opposite for global memory requirements, so you are not going to exceed them. Although the GCN cards report about the same amounts of local and constant memory (32Kb + 64Kb), they also have 32Kb of L1 cache which may help. Maybe they also have more private space (registers). Global memory is used for V space only. Not much activity there. Everything else runs in private/local space.

Another guess there is something wrong with the miner itself related to scheduling of kernel threads. Increase intensity over 13 and hash rate reduces. Increase it even more and see HW errors. Set to 20 and it hangs up. Scrypt can do 20, but it’s different. Need to start with a clean fork and add the NeoScrypt support myself probably. Have a few other ideas, but they also need work.

Wolf0

My 1st guess it runs out of private memory. It takes 512 bytes for block mixing + 800 bytes for FastKDF and BLAKE2s per kernel instance. That’s not including local variables, counters, etc. Scrypt consumes 3 times less private memory. It’s opposite for global memory requirements, so you are not going to exceed them. Although the GCN cards report about the same amounts of local and constant memory (32Kb + 64Kb), they also have 32Kb of L1 cache which may help. Maybe they also have more private space (registers). Global memory is used for V space only. Not much activity there. Everything else runs in private/local space.

Another guess there is something wrong with the miner itself related to scheduling of kernel threads. Increase intensity over 13 and hash rate reduces. Increase it even more and see HW errors. Set to 20 and it hangs up. Scrypt can do 20, but it’s different. Need to start with a clean fork and add the NeoScrypt support myself probably. Have a few other ideas, but they also need work.

I feel stupid. For some reason, I was thinking of GCN cards while talking about 6xxx. Oops.