[Dev] NeoScrypt GPU Miner - Public Beta Test

Wolf0

It’s scalar now. I’m not impressed by one found in old Scrypt kernels. Can write a better one probably. It isn’t a bottleneck anyway for these cards. When I replaced ChaCha with this one, it went from something like 12KH/s to 14KH/s. Loop unrolling delivered more alone.

BTW, it runs reasonably good with old AMD drivers and OpenCL compilers. HD5870 on Windows XP with 12.4 drivers went from 2.5KH/s to 10KH/s. Hell yeah, a 4x increase. Don’t try to use this kernel for NVIDIA. It fails to compile the vectorised ChaCha code.

Rename and put into any miner with the NeoScrypt support. Tested on cgminer v3.7.7, works fine.

I’ve never worked on 6xxx, but isn’t shuffle cheap? One shuffle for the permutation, keep it through all the salsa rounds + XOR ops, one shuffle to fix it. Seems like it’d be worth it - of course, unrolling will deliver a lot, probably.

Wolf0

It seems my 280X and 290X do like parallel chacha - I just needed to tweak it a bit more. Code size seems about the same, though, small speedup on execution time, I think.

Wolf0

OH MY GOD. I’ve been staring at this code for ages, and it only JUST NOW occurred to me that SMix() is parallelizable. Not the internals of SMix, of course, but the two calls to it…

cisahasa

i really hope wolf you are not doing this just for your self…

you would gain more if you release your work, im sure people here would like to collect some bounty for your work to you release latest kernels.

these people here are fair people.

Wolf0

i really hope wolf you are not doing this just for your self…

you would gain more if you release your work, im sure people here would like to collect some bounty for your work to you release latest kernels.

these people here are fair people.

I’m doing this because it’s interesting. Also, SMix being parallelizable hardly matters unless you split it into 3 kernels, which is doable, but idk what the overhead on the kernel launches would be…

T4rQu1N

Quick question, in my bat file, how do I specify different values for say -i or -w so that my two cards (which are different) have different settings?

-i 14,15?

-w 48, 72?

Kind regards,

T4

einkerl

Quick question, in my bat file, how do I specify different values for say -i or -w so that my two cards (which are different) have different settings?

-i 14,15?

-w 48, 72?

Kind regards,

T4

“intensity” : “18,18,18”,
“worksize” : “256,128,256”,

specify in your .conf

ghostlander

neoscrypt_vliw.cl v2

It’s 19.5KH/s now on a HD6970. FastKDF and BLAKE2s have been cleaned up and optimised, memory requirements reduced.

OH MY GOD. I’ve been staring at this code for ages, and it only JUST NOW occurred to me that SMix() is parallelizable. Not the internals of SMix, of course, but the two calls to it…

Yeah, I’ve mentioned this in my white paper. Not sure if it’s of any use for mining.

I’ve never worked on 6xxx, but isn’t shuffle cheap? One shuffle for the permutation, keep it through all the salsa rounds + XOR ops, one shuffle to fix it. Seems like it’d be worth it - of course, unrolling will deliver a lot, probably.

It is, but that’s not what concerns me now. With FastKDF removed, the kernel gets reduced in size by ~60% and outputs 30KH/s.That’s a big overhead, but not critical and I’ve expected more out of ChaCha + Salsa. With ChaCha only enabled, it’s 58KH/s and with Salsa only = 56KH/s. Scalar Salsa isn’t supposed to be about as fast as vectorised ChaCha. It’s clearly scalar because the AMD compiler isn’t really smart and the kernel size is about double of ChaCha only size. Anyway, there is a huge bottleneck somewhere and it needs to be identified.

Wolf0

neoscrypt_vliw.cl v2

It’s 19.5KH/s now on a HD6970. FastKDF and BLAKE2s have been cleaned up and optimised, memory requirements reduced.

Yeah, I’ve mentioned this in my white paper. Not sure if it’s of any use for mining.

It is, but that’s not what concerns me now. With FastKDF removed, the kernel gets reduced in size by ~60% and outputs 30KH/s.That’s a big overhead, but not critical and I’ve expected more out of ChaCha + Salsa. With ChaCha only enabled, it’s 58KH/s and with Salsa only = 56KH/s. Scalar Salsa isn’t supposed to be about as fast as vectorised ChaCha. It’s clearly scalar because the AMD compiler isn’t really smart and the kernel size is about double of ChaCha only size. Anyway, there is a huge bottleneck somewhere and it needs to be identified.

I have a really hard time reading your style, but the code is pretty good! Don’t you think that bottleneck is waiting for global memory, though?

ghostlander

My 1st guess it runs out of private memory. It takes 512 bytes for block mixing + 800 bytes for FastKDF and BLAKE2s per kernel instance. That’s not including local variables, counters, etc. Scrypt consumes 3 times less private memory. It’s opposite for global memory requirements, so you are not going to exceed them. Although the GCN cards report about the same amounts of local and constant memory (32Kb + 64Kb), they also have 32Kb of L1 cache which may help. Maybe they also have more private space (registers). Global memory is used for V space only. Not much activity there. Everything else runs in private/local space.

Another guess there is something wrong with the miner itself related to scheduling of kernel threads. Increase intensity over 13 and hash rate reduces. Increase it even more and see HW errors. Set to 20 and it hangs up. Scrypt can do 20, but it’s different. Need to start with a clean fork and add the NeoScrypt support myself probably. Have a few other ideas, but they also need work.

Wolf0

My 1st guess it runs out of private memory. It takes 512 bytes for block mixing + 800 bytes for FastKDF and BLAKE2s per kernel instance. That’s not including local variables, counters, etc. Scrypt consumes 3 times less private memory. It’s opposite for global memory requirements, so you are not going to exceed them. Although the GCN cards report about the same amounts of local and constant memory (32Kb + 64Kb), they also have 32Kb of L1 cache which may help. Maybe they also have more private space (registers). Global memory is used for V space only. Not much activity there. Everything else runs in private/local space.

Another guess there is something wrong with the miner itself related to scheduling of kernel threads. Increase intensity over 13 and hash rate reduces. Increase it even more and see HW errors. Set to 20 and it hangs up. Scrypt can do 20, but it’s different. Need to start with a clean fork and add the NeoScrypt support myself probably. Have a few other ideas, but they also need work.

I feel stupid. For some reason, I was thinking of GCN cards while talking about 6xxx. Oops.

Wolf0

Preparing my GCN kernel for public release; cleaning code, removing stuff I tried that really sucked, like completely unrolled chacha/salsa, stuff like that. After that, I’ll package it up with SGMiner and it should be good to go. Should give results like this (NSFW): https://ottrbutt.com/miner/neoscryptwolf-11082014.png

Alpha Wolf

Preparing my GCN kernel for public release; cleaning code, removing stuff I tried that really sucked, like completely unrolled chacha/salsa, stuff like that. After that, I’ll package it up with SGMiner and it should be good to go. Should give results like this (NSFW): https://ottrbutt.com/miner/neoscryptwolf-11082014.png

Those numbers look great, can’t wait to try this. :)

Does the version of SGMiner your building have xIntensity or have you given any thought to using cgminer 3.7.3 Kalroth that has xIntensity for a build?

More info can be found here from that page it states the new SGMIner 4.1 has xintensity and might be a better choose. Personally I like

cgminer better and had better results with it than sgminer so far.

Wolf0

Those numbers look great, can’t wait to try this. :)

Does the version of SGMiner your building have xIntensity or have you given any thought to using cgminer 3.7.3 Kalroth that has xIntensity for a build?

More info can be found here from that page it states the new SGMIner 4.1 has xintensity and might be a better choose. Personally I like

cgminer better and had better results with it than sgminer so far.

Doesn’t matter - kernel can be used with both.

EDIT: It can be used with any CGMiner/SGMiner that has Neoscrypt support, that is.

daimyo

WHOOOOOOOOOOOA!!!

Installed 14.9 drivers and got cgminer 3.8.7

The result:

hashrate jumped from 95 to 135!!! :))) Same temps!!!

Wolf0

WHOOOOOOOOOOOA!!!

Installed 14.9 drivers and got cgminer 3.8.7

The result:

hashrate jumped from 95 to 135!!! :))) Same temps!!!

Was that your first time using my fixed kernel on 14.9?

daimyo

Was that your first time using my fixed kernel on 14.9?

Actually yes… i guess i am being a bit slow on those updates :D Good job! Thanks for your involvement

Wolf0

Actually yes… i guess i am being a bit slow on those updates :D Good job! Thanks for your involvement

No problem; you should be getting more hash soon!

slowhash

No problem; you should be getting more hash soon!

Let me step right up and personally thank you for the development you have done on this.

Post a btc address and I’ll send you a couple satoshi, or post a guncoin address and I’ll send you a couple thousand. ;)

xIIImaL

neoscrypt_vliw.cl v2

It’s 19.5KH/s now on a HD6970. FastKDF and BLAKE2s have been cleaned up and optimised, memory requirements reduced.

Yeah, I’ve mentioned this in my white paper. Not sure if it’s of any use for mining.

It is, but that’s not what concerns me now. With FastKDF removed, the kernel gets reduced in size by ~60% and outputs 30KH/s.That’s a big overhead, but not critical and I’ve expected more out of ChaCha + Salsa. With ChaCha only enabled, it’s 58KH/s and with Salsa only = 56KH/s. Scalar Salsa isn’t supposed to be about as fast as vectorised ChaCha. It’s clearly scalar because the AMD compiler isn’t really smart and the kernel size is about double of ChaCha only size. Anyway, there is a huge bottleneck somewhere and it needs to be identified.

Don’t work for me now. Cards in rig: 6950,6870,5870, miner 3.7.7b. Screen: